NVIDIA Nemotron 3 Ultra: The Fastest Open Model for Agent AI

◷ 6 min read 6/5/2026 by: Alexey, VibeCode

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

NVIDIA Nemotron 3 Ultra: The Fastest Open Model for Agent AI - обложка

On June 4, 2026, NVIDIA released the Nemotron 3 Ultra, the flagship model of the Nemotron 3 family and by far one of the most productive open language models in the world. This is not just another big LLM: the architecture, the principles of learning and the focus of application are fundamentally different.

Why do you even need to

Single chatbots are a thing of the past. Modern AI is a long chain of agents: the scheduler calls tools, tools return data, data is transmitted further, subtasks are delegated to sub-agents. With each step, the context window grows, the cost of inferencing increases, and the risk of loss of purpose accumulates.

Nemotron 3 Ultra is designed for this scenario: complex orchestration, long context, high speed – with accuracy at the level of the best open models.

What's Inside: Architecture

550 billion parameters, 55 active parameters

The Nemotron 3 Ultra is a Mixture-of-Experts model. The total parameters are 550B, but only 55B is activated for each token. This means: the power of a large model with significantly lower computational costs for inference.

Mamba + Transformer hybrid

Instead of a pure Transformer architecture, NVIDIA uses a hybrid: most layers are Mamba (long sequence processing with sub-square complexity), several layers of Attention are left to accurately extract information from context. This provides a context window of up to 1 million tokens at a reasonable cost.

LatentMoE

Own routing mechanism of experts from NVIDIA. Compared to the standard MoE, it improves the accuracy of the model due to improved distribution of tokens among specialized experts.

MTP (Multi-Token Prediction)

The model predicts multiple tokens simultaneously, which works as built-in speculative decoding. The adoption of the first two predicted tokens is around 97% – this significantly speeds up generation, especially in long sessions.

NVFP4 quantization

The model is pretrained in NVFP4 format - 4-bit floating point accuracy from NVIDIA. This provides up to 5x higher throughput compared to similar models on the GB200/GB300 GPU architecture with minimal quality losses (deviation from BF16 is less than 0.4%).

How to train

Pre-education

The model is trained on 16+ trillion tokens, including:

173 billion GitHub code tokens (data through September 2025)
synthetic datasets for legal purposes
datasets to improve actual accuracy and complex reasoning scenarios

Post-training: SFT → RL → MOPD

After pre-training, the model underwent a three-stage pipeline training:

SFT - supervised fine tuning for basic alignment
RL – Multi-environmental reinforcement learning (15 new RL environments in this release, 55 in total)
MOPD (Multi-teacher On-Policy Distillation) - knowledge distillation from 10+ specialized models-teachers. Each round: the student rollouts, the teachers give tight feedback, the knowledge merges back. NVIDIA conducted 2 iterations of the MOPD for Nemotron 3 Ultra.

Performance: Accuracy and Speed

Accuracy on key benchmarks

Бенчмарк	Nemotron 3 Ultra	GLM 5.1 (744B)	Kimi K2.6 (1T)	Qwen 3.5 (397B)
Agent Productivity (PinchBench)	91%	84%	91%	89%
Instruction Following (IFBench)	82%	77%	74%	78%
Long Context (Ruler @1M)	95%	N/A (макс. 256K)	N/A (макс. 256K)	90%
Professional Work (ProfBench)	56%	46%	56%	53%
Coding (Terminal-Bench 2.0)	54%	64%	67%	53%
SWE-Bench Verified	71.9	—	—	—
IOI 2025 (конкурсное программирование)	570.0	—	—	—

According to NVIDIA, the result of 570.0 at IOI 2025 corresponds to the top 3 level among people in competitive programming.

Inferencing speed

This is the main advantage of the model. On the 8K entry/64K exit token setting, Nemotron 3 Ultra outperforms competitors:

in 5.9x faster than GLM-5.1-754B
in 4.8x faster than Kimi-K2.6-1T
in 1.6x faster than Qwen-3.5-397B

It’s not just statistics – for agent systems, where a single session can generate hundreds of thousands of tokens, speed directly affects cost and applicability.

Three modes of reasoning

The model supports reasoning budget management in runtime:

Reasoning off - direct answer without a chain of reasoning, for simple tasks
Regular - Standard reasoning
Medium - Extended reasoning for complex problems

This allows you to balance between speed and depth of analysis during the operation of the agent.

Openness: What is available

NVIDIA releases Nemotron 3 Ultra completely openly under the OpenMDW-1.1 license:

model weight (BF16 and NVFP4)
10+ trillion training data tokens
instructional
50 million SFT samples (10M new in this release)
2 million RL tasks (1M new)
55 RL environments (15 new)

It's rare for a model of this scale to open either weights or a piece of data.

What tasks are suitable for

The Nemotron 3 Ultra was designed for scenarios where other models are either too expensive or too slow

*Coding ** - Supporting architectural solutions through long sessions, debugging complex systems
**Research ** - synthesis of information from hundreds of sources with full context
Autonomous agents – orchestration of multi-way workflows with instrument call
*Verification – Testing complex systems (such as chip design) against thousands of constraints

Where to start

The model is available through:

Hugging Face - weights for self-deployment
NVIDIA NIM - ready-made inference via API
build.nvidia.com - NVIDIA cloud platform

For maximum performance, the GB200 GPU architecture with TRT-LLM is recommended.

Outcome

Nemotron 3 Ultra is NVIDIA’s answer to the practical problem of Agent AI: how to get the quality of a frontier model at cost and production speed. Hybrid Mamba architecture, NVFP4-quantization and MTP-acceleration give up to 5-6x advantages throughput at comparable accuracy with models that weigh 2-3 times more.

Complete openness – weights, data, recipes – makes Nemotron 3 Ultra a rare case where the research community gets not just a model, but the entire stack for playback and learning.

NVIDIA Nemotron 3 Ultra: The Fastest Open Model for Agent AI

## Why do you even need to

## What's Inside: Architecture

### 550 billion parameters, 55 active parameters

### Mamba + Transformer hybrid

### LatentMoE

### MTP (Multi-Token Prediction)

### NVFP4 quantization

## How to train

### Pre-education

### Post-training: SFT → RL → MOPD

## Performance: Accuracy and Speed

### Accuracy on key benchmarks

### Inferencing speed

## Three modes of reasoning

## Openness: What is available

## What tasks are suitable for

## Where to start

## Outcome

Reve 2.0 – a new revolution in 2026 image generation: layouts, 4K and “touchable images”

Wikivibe MCP: how to connect a site to AI agents and why it is necessary

Why do you even need to

What's Inside: Architecture

550 billion parameters, 55 active parameters

Mamba + Transformer hybrid

LatentMoE

MTP (Multi-Token Prediction)

NVFP4 quantization

How to train

Pre-education

Post-training: SFT → RL → MOPD

Performance: Accuracy and Speed

Accuracy on key benchmarks

Inferencing speed

Three modes of reasoning

Openness: What is available

What tasks are suitable for

Where to start

Outcome