NVIDIA Nemotron 3 Ultra: The Fastest Open Model for Agent AI
Main chat
A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.
On June 4, 2026, NVIDIA released the Nemotron 3 Ultra, the flagship model of the Nemotron 3 family and by far one of the most productive open language models in the world. This is not just another big LLM: the architecture, the principles of learning and the focus of application are fundamentally different.
Why do you even need to
Single chatbots are a thing of the past. Modern AI is a long chain of agents: the scheduler calls tools, tools return data, data is transmitted further, subtasks are delegated to sub-agents. With each step, the context window grows, the cost of inferencing increases, and the risk of loss of purpose accumulates.
Nemotron 3 Ultra is designed for this scenario: complex orchestration, long context, high speed – with accuracy at the level of the best open models.
What's Inside: Architecture
550 billion parameters, 55 active parameters
The Nemotron 3 Ultra is a Mixture-of-Experts model. The total parameters are 550B, but only 55B is activated for each token. This means: the power of a large model with significantly lower computational costs for inference.
Mamba + Transformer hybrid
Instead of a pure Transformer architecture, NVIDIA uses a hybrid: most layers are Mamba (long sequence processing with sub-square complexity), several layers of Attention are left to accurately extract information from context. This provides a context window of up to 1 million tokens at a reasonable cost.
LatentMoE
Own routing mechanism of experts from NVIDIA. Compared to the standard MoE, it improves the accuracy of the model due to improved distribution of tokens among specialized experts.
MTP (Multi-Token Prediction)
The model predicts multiple tokens simultaneously, which works as built-in speculative decoding. The adoption of the first two predicted tokens is around 97% – this significantly speeds up generation, especially in long sessions.
NVFP4 quantization
The model is pretrained in NVFP4 format - 4-bit floating point accuracy from NVIDIA. This provides up to 5x higher throughput compared to similar models on the GB200/GB300 GPU architecture with minimal quality losses (deviation from BF16 is less than 0.4%).
How to train
Pre-education
The model is trained on 16+ trillion tokens, including:
- 173 billion GitHub code tokens (data through September 2025)
- synthetic datasets for legal purposes
- datasets to improve actual accuracy and complex reasoning scenarios
Post-training: SFT → RL → MOPD
After pre-training, the model underwent a three-stage pipeline training:
- SFT - supervised fine tuning for basic alignment
- RL – Multi-environmental reinforcement learning (15 new RL environments in this release, 55 in total)
- MOPD (Multi-teacher On-Policy Distillation) - knowledge distillation from 10+ specialized models-teachers. Each round: the student rollouts, the teachers give tight feedback, the knowledge merges back. NVIDIA conducted 2 iterations of the MOPD for Nemotron 3 Ultra.
Performance: Accuracy and Speed
Accuracy on key benchmarks
| Бенчмарк | Nemotron 3 Ultra | GLM 5.1 (744B) | Kimi K2.6 (1T) | Qwen 3.5 (397B) |
|---|---|---|---|---|
| Agent Productivity (PinchBench) | 91% | 84% | 91% | 89% |
| Instruction Following (IFBench) | 82% | 77% | 74% | 78% |
| Long Context (Ruler @1M) | 95% | N/A (макс. 256K) | N/A (макс. 256K) | 90% |
| Professional Work (ProfBench) | 56% | 46% | 56% | 53% |
| Coding (Terminal-Bench 2.0) | 54% | 64% | 67% | 53% |
| SWE-Bench Verified | 71.9 | — | — | — |
| IOI 2025 (конкурсное программирование) | 570.0 | — | — | — |
According to NVIDIA, the result of 570.0 at IOI 2025 corresponds to the top 3 level among people in competitive programming.
Inferencing speed
This is the main advantage of the model. On the 8K entry/64K exit token setting, Nemotron 3 Ultra outperforms competitors:
- in 5.9x faster than GLM-5.1-754B
- in 4.8x faster than Kimi-K2.6-1T
- in 1.6x faster than Qwen-3.5-397B
It’s not just statistics – for agent systems, where a single session can generate hundreds of thousands of tokens, speed directly affects cost and applicability.
Three modes of reasoning
The model supports reasoning budget management in runtime:
- Reasoning off - direct answer without a chain of reasoning, for simple tasks
- Regular - Standard reasoning
- Medium - Extended reasoning for complex problems
This allows you to balance between speed and depth of analysis during the operation of the agent.
Openness: What is available
NVIDIA releases Nemotron 3 Ultra completely openly under the OpenMDW-1.1 license:
- model weight (BF16 and NVFP4)
- 10+ trillion training data tokens
- instructional
- 50 million SFT samples (10M new in this release)
- 2 million RL tasks (1M new)
- 55 RL environments (15 new)
It's rare for a model of this scale to open either weights or a piece of data.
What tasks are suitable for
The Nemotron 3 Ultra was designed for scenarios where other models are either too expensive or too slow
- *Coding ** - Supporting architectural solutions through long sessions, debugging complex systems
- **Research ** - synthesis of information from hundreds of sources with full context
- Autonomous agents – orchestration of multi-way workflows with instrument call
- *Verification – Testing complex systems (such as chip design) against thousands of constraints
Where to start
The model is available through:
- Hugging Face - weights for self-deployment
- NVIDIA NIM - ready-made inference via API
- build.nvidia.com - NVIDIA cloud platform
For maximum performance, the GB200 GPU architecture with TRT-LLM is recommended.
Outcome
Nemotron 3 Ultra is NVIDIA’s answer to the practical problem of Agent AI: how to get the quality of a frontier model at cost and production speed. Hybrid Mamba architecture, NVFP4-quantization and MTP-acceleration give up to 5-6x advantages throughput at comparable accuracy with models that weigh 2-3 times more.
Complete openness – weights, data, recipes – makes Nemotron 3 Ultra a rare case where the research community gets not just a model, but the entire stack for playback and learning.