Nemotron 3 Nano 4B: A Compact Hybrid Model for Efficient Local AI

NVIDIA has introduced Nemotron 3 Nano 4B, a compact 4-billion parameter model featuring a hybrid Mamba-Transformer architecture designed for high-efficiency local AI deployment on RTX GPUs and Jetson platforms.
We are excited to introduce Nemotron 3 Nano 4B, the newest and most compact member of the Nemotron 3 family. Leveraging hybrid Mamba-Transformer architecture, this model is designed for efficiency and accuracy in a targeted set of capabilities, setting a new standard for lightweight small language models. The model is available across any NVIDIA GPU-enabled platforms and combines state-of-the-art instruction following and exceptional tool use with minimal VRAM footprint.
With just 4 billion parameters, Nemotron 3 Nano 4B is compact enough to run at the edge on NVIDIA Jetson platforms (Jetson Thor/Jetson Orin Nano) as well as NVIDIA DGX Spark and NVIDIA RTX GPUs. This enables faster response times, enhanced data privacy, and flexible deployment while keeping inference costs low.
Nemotron 3 Nano 4B is our first model specifically optimized for on-device deployment and purpose-built to power local conversational agents and personas across GeForce RTX, Jetson and Spark customer use cases. This model achieves state-of-the-art accuracy and efficiency in several dimensions key to production use on the edge:
- Instruction following (IFBench, IFEval): state-of-the-art in its size class
- Gaming agency/intelligence (Orak): state-of-the-art in its size class
- VRAM efficiency (peak memory use): lowest VRAM footprint in its size class under both low and high ISL/OSL settings
- Latency: lowest TTFT in its size class under high ISL settings
Furthermore, Nemotron 3 Nano 4B delivers excellent tool-use performance and is highly competitive in hallucination avoidance. Together, these capabilities demonstrate the model’s strong suitability for edge use cases.
Nemotron 3 Nano 4B was pruned and distilled from Nemotron Nano 9B v2 using the Nemotron Elastic framework, allowing it to inherit the strong reasoning capabilities as a hybrid reasoning model. It was further post-trained with a new recipe derived from Nemotron 3 Post-training data, enabling the model to excel at task solving even without explicit thinking. Finally, as an open-source model, it empowers the ecosystem to customize, fine-tune, and optimize it for domain-specific use cases.
Source: Hugging Face Blog










