Nemotron 3 delivers agentic AI with a hybrid Mamba-Transformer MoE core, offering Nano for low-cost inference and Super/Ultra for scalable reasoning.
AI News Team

Continue your reading
Nemotron 3: NVIDIA's Hybrid MoE Open Models for Efficient Agentic AI
NVIDIA has rolled out Nemotron 3, a family of open models built for agentic AI workloads. Nano is already available, with Super and Ultra due in the coming months. Published on December 15, 2025, the Nemotron 3 lineup aims to be among the most efficient open models for reasoning, dialogue, and long-form interactions, while promising solid per-inference cost profiles. If you’re building autonomous agents or AI copilots that need deep reasoning without blowing up compute costs, this is one to watch.
The technical core is as important as the name. Nemotron 3 is built around a hybrid Mamba-Transformer MoE architecture, a variant of the Mixture-of-Experts idea that seeks to keep throughput high while preserving or beating the accuracy of traditional transformers. The Nano model emphasizes cost-efficient inference and accuracy that outperforms comparable models at smaller scales. The roadmap makes it clear the larger brothers aren’t just bigger; they bring architectural refinements like LatentMoE, a hardware-aware expert design aimed at improving accuracy for Super and Ultra. In addition, the Super and Ultra architectures include Multi-Token Prediction layers to boost long-form generation efficiency and quality, and "NVFP4" is mentioned in the feature list as part of the performance toolkit for these scales.
This isn’t noise. The emphasis on open models and practical agentic capabilities means developers should consider what Nemotron 3 enables that standard transformers struggle with at scale. The Nano variant is pitched as a strong accuracy punch at low inference cost, which matters for edge cases or consumer-grade deployments where latency and power budgets are tight. The Super configuration targets collaborative agents and high-volume workflows such as IT ticket automation, where throughput and consistent responses matter as much as correctness. Ultra, by contrast, claims top-tier accuracy and reasoning performance, aimed at the most demanding tasks.
From a developer perspective, the story here is open access paired with architectural specialization. MoE models have historically offered a path to scale parameters without a proportional jump in compute, but they bring routing, sparsity, and hardware considerations. Nemotron 3 uses a Mamba-Transformer MoE core for throughput plus LatentMoE for the higher tiers to squeeze more accuracy from the same or similar compute budgets. The addition of Multi-Token Prediction layers is designed to improve efficiency and quality for long-form content, which matters for agents that must maintain coherence across multi-turn conversations or extended reasoning traces. If you’re evaluating these for production, you’ll want to study the accompanying white paper and technical reports to understand the routing, sparsity, and hardware optimizations in detail.
The wider context matters too. Nemotron 3 continues a familiar thread in open-model and MoE research, aligning with the industry move toward sparsely gated expert networks as a path to scaling intelligence without a straight-line compute bill. For teams weighing options, it’s worth contrasting with the original mixture-of-experts concept and later MoE iterations that showed how sparsity can unlock large parameter counts with tractable training and inference costs. For developers already invested in NVIDIA tooling, there are related docs to explore, including NVIDIA's research pages and the Nemo tooling for building and deploying AI models.
What this means for you as a developer or AI engineer is practical: evaluate Nano for cost-conscious inference, especially in high-throughput but tighter-budget scenarios; plan for Super if you need reliable agent collaboration and automation at scale; reserve Ultra for ambitious reasoning tasks where the highest accuracy matters and you have the compute headroom. The timing matters too. Nano is out now, with Super and Ultra following in the coming months. These releases will interact with existing NVIDIA tooling and docs, so stay tuned for official guidance on deployment, benchmarking, and integration with your existing MLOps stack. For anyone building agentic assistants, chatbots, or automated ticketing systems, Nemotron 3 is a solid example that shows open models can be both more capable and more cost-efficient at scale.
Looking ahead, this reminds us that real AI progress often blends open access with practical architecture. If you want to dig deeper, start with the primary source and related materials from NVIDIA. You can read more on the Nemotron 3 page, check NVIDIA Research for context, study the original MoE concepts, and connect this to NVIDIA's tooling. The pace of release and the tiered approach to capabilities suggests a path where teams tailor their model choice to their workload, rather than chasing a single giant model. For anyone building production AI systems, Nemotron 3 deserves a careful read and a concrete test plan.
NVIDIA Nemotron 3 page | NVIDIA Research | Outrageously Large Neural Networks: The Sparsely-Gated Mixture of Experts (MoE) concept | NVIDIA Nemo developer resources | Switch Transformers and sparse MoE context (Google AI Blog)