NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing


Training a family of large language models (LLMs) has always come with a painful multiplier: every model variant in the family—whether 8B, 30B, or 70B—typically requires its own full training run, its own storage, and its own deployment stack. For a dev team running inference at scale, this means multiplying compute costs by the number of model sizes they want to support. NVIDIA researchers are now proposing a different approach called Star Elastic.

Star Elastic is a post-training method that embeds multiple nested submodels—at different parameter budgets—inside a single parent reasoning model, using a single training run. Applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variants live in one checkpoint and can be extracted without any additional fine-tuning.

What does “Nested” Actually Mean here

If you haven’t encountered elastic or nested architectures before, the idea is this: instead of training three separate 30B, 23B, and 12B models, you train one model that contains the smaller ones as subsets of itself. The smaller submodels reuse the most important weights from the parent, identified through a process called importance estimation.

Star Elastic scores each model component: embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels by how much they contribute to model accuracy. Components are then ranked and sorted, so smaller-budget submodels always use the highest-ranked contiguous subset of components from the larger model. This property is called nested weight-sharing.

The method supports nesting along multiple axes: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba heads and head channels, MoE expert count, and FFN intermediate dimension. For MoE layers specifically, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP), which ranks experts by both routing gate values and expert output magnitudes—a more principled signal than naive frequency-based pruning, which ignores how much each expert actually contributes to the layer output.

A Learnable Router, Not a Fixed Compression Recipe

A key distinction from prior compression methods like Minitron is that Star Elastic uses an end-to-end trainable router to determine the nested submodel architectures. The router takes a target budget (e.g., “give me a 2.8B active parameter model”) as a one-hot input and outputs differentiable masks that select which components are active at that budget level. These masks are trained jointly with the model through Gumbel-Softmax, which allows gradient flow through discrete architectural decisions.

The loss function combines knowledge distillation (KD) where the non-elastified parent model acts as the teacher with a router loss that penalizes deviation from the target resource budget (parameter count, memory, or latency). This means the router learns to make architecture choices that actually improve accuracy under KD, rather than just minimizing a proxy metric.

Training uses a two-stage curriculum: a short-context phase (sequence length 8,192 tokens) with uniform budget sampling, followed by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended context phase is critical for reasoning performance. The research team’s ablations on Nano v2—explicitly reproduced as the empirical basis for the same curriculum choice on Nano v3 show gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone, motivating its use here.

Elastic Budget Control: Different Models for Different Reasoning Phases

Existing budget control in reasoning models including Nemotron Nano v3’s own default behavior works by capping the number of tokens generated during a phase before forcing a final answer. This approach uses the same model throughout. Star Elastic unlocks a different strategy: using different nested submodels for the thinking phase versus the answering phase.

The researchers evaluated four configurations. The optimal one, called ℳS → ℳL (small model for thinking, large model for answering), allocates a cheaper model to generate extended reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration in particular advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency compared to default Nemotron Nano v3 budget control. The intuition: reasoning tokens are high-volume but tolerant of some capacity reduction; the final answer requires higher precision.

Quantization Without Breaking the Nested Structure

A naive approach to deploying a quantized elastic model would be to quantize each variant separately after slicing. That breaks the nested weight-sharing property and requires a separate quantization pass per size. Instead, Star Elastic applies Quantization-Aware Distillation (QAD) directly on the elastic checkpoint, preserving the nested mask hierarchy throughout.

For FP8 (E4M3 format), post-training quantization (PTQ) is sufficient, recovering 98.69% of BF16 accuracy on the 30B variant. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone causes a 4.12% average accuracy drop, so a short nested QAD phase (~5B tokens at 48K context) brings recovery back to 97.79% for the 30B variant. In both cases, zero-shot slicing of the 23B and 12B variants from the single quantized checkpoint is preserved.

The memory implications are significant. Storing separate 12B, 23B, and 30B BF16 checkpoints requires 126.1 GB; the single elastic checkpoint requires 58.9 GB. The 30B NVFP4 elastic checkpoint fits in 18.7 GB, enabling the 12B NVFP4 variant to run on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 tokens/s, a 3.4× throughput improvement over the 30B BF16 baseline.

Depth vs. Width: Why Star Elastic Compresses Width

One design choice worth calling out explicitly: the research team compared two compression strategies—removing layers entirely (depth compression) versus reducing internal dimensions like hidden size, expert count, and head count (width compression). With a 15% parameter reduction and 25B tokens of knowledge distillation, width compression recovered 98.1% of baseline performance while depth compression recovered only 95.2%, with noticeable degradation on HumanEval and MMLU-Pro. As a result, Star Elastic prioritizes width-based elasticity for its main results, though depth compression (layer skipping) remains available as a mechanism for extreme latency-constrained scenarios.

On the evaluation suite—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, and Tau Bench—the Elastic-30B variant matches its parent Nemotron Nano v3 30B on most benchmarks, while the Elastic-23B and Elastic-12B variants remain competitive against independently trained models of similar sizes. The Elastic-23B notably scores 85.63 on AIME-2025 versus Qwen3-30B-A3B’s 80.00, despite having fewer active parameters.

On training cost, the research team reports a 360× token reduction compared to pretraining each variant from scratch, and a 7× reduction over prior state-of-the-art compression methods that require sequential distillation runs per model size. The 12B variant runs at 2.4× the throughput of the 30B parent on an H100 GPU at bfloat16 with the same input/output sequence lengths.

How to Use NVIDIA Star Elastic

Step-by-Step Guide

Nemotron Nano v3 Elastic — 30B / 23B / 12B in one checkpoint  ·  BF16 / FP8 / NVFP4

Star Elastic models are distributed via Hugging Face and support both
Transformers (for experimentation) and vLLM
(recommended for production inference). Pick the option that fits your use case.

bash

# Option A — vLLM (recommended for production serving)
pip install vllm

# Option B — Transformers (for local experimentation)
pip install transformers torch accelerate

# Optional: log in to Hugging Face if needed
pip install huggingface_hub
huggingface-cli login



Hardware note: The 30B BF16 checkpoint requires ~60 GB VRAM for the full nested family.
Use FP8 (~31 GB) or NVFP4 (~19 GB) for H100/A100 or RTX-class deployment.

A single checkpoint contains all three nested variants — 30B (3.6A),
23B (2.8A), and 12B (2.0A). Load once; extract any variant
without retraining. The model requires trust_remote_code=True for the hybrid
Mamba–Transformer–MoE architecture.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# The 30B BF16 elastic checkpoint — contains all 3 nested variants
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"     # distributes across available GPUs
)

print(f"Model loaded: {model_id}")



Active vs. total parameters: “30B total / 3.6B active” means the model stores
30B weights but only routes each token through 3.6B parameters per forward pass — this is how
Mixture-of-Experts (MoE) works.

The model uses a token to generate a reasoning chain before
producing its final answer. Control the total token budget via max_new_tokens
— higher values allow longer reasoning traces on hard problems.

python

messages = [
    {
        "role": "user",
        "content": "What is the time complexity of QuickSort, and why?"
    }
]

# Apply chat template and tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

# Generate — model produces ... then the final answer
outputs = model.generate(
    **inputs,
    max_new_tokens=4096,    # thinking + answer budget
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(response)



Thinking budget tip: For math/coding problems, set max_new_tokens
to 8192–32768. For simpler queries, 2048–4096 is sufficient and reduces latency.

For production deployments, use vLLM to serve the model via an
OpenAI-compatible REST API. This enables batched inference, continuous batching,
and higher throughput — the 12B variant achieves 2.4× the throughput
of the 30B parent on an H100 GPU.

bash

# Start the vLLM server (OpenAI-compatible)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# --- In a separate terminal ---

# Query the server via curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16",
    "messages": [
      {
        "role": "user",
        "content": "Explain gradient descent in 3 steps."
      }
    ],
    "max_tokens": 4096,
    "temperature": 0.6
  }'

# Or run via Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16



SGLang alternative: SGLang is also supported —
run python3 -m sglang.launch_server --model-path "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16" --port 30000
for a drop-in alternative to vLLM.

Three quantized checkpoints are available. All preserve the nested structure
— the 23B and 12B submodels can be extracted zero-shot from whichever precision checkpoint
you load. NVFP4 uses Quantization-Aware Distillation (QAD) to recover accuracy lost from PTQ.

bash

# BF16 — full precision, all nested variants in 58.9 GB
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# FP8 (E4M3) — ~2× smaller, 30B fits in 31.4 GB
# Post-training quantization, 98.69% accuracy recovery on 30B
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8"

# NVFP4 — smallest footprint, 30B fits in 18.7 GB
# 12B NVFP4 variant runs on RTX 5080 (BF16 OOMs)
# 12B NVFP4 on RTX Pro 6000: 7,426 tokens/s (3.4× vs 30B BF16)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4"

Variant30B memory23B memory12B memoryBest for
BF16 Full58.9 GB44.0 GB23.2 GBA100 / H100
FP8 PTQ31.4 GB23.7 GB13.0 GBH100 / A100 / RTX 5090
NVFP4 QAD18.7 GB14.1 GB8.0 GBRTX 5080 / 5090 / Pro 6000


Step 1 of 5

Key Takeaways

  • Star Elastic trains 30B, 23B, and 12B nested reasoning models from a single 160B-token post-training run, achieving a 360× token reduction over pretraining from scratch.
  • Elastic budget control (23B for thinking, 30B for answering) improves the accuracy–latency Pareto frontier by up to 16% accuracy and 1.9× latency gains.
  • A learnable router with Gumbel-Softmax enables end-to-end trainable architecture selection, eliminating the need for separate compression runs per model size.
  • Nested QAD preserves zero-shot slicing across FP8 and NVFP4 quantized checkpoints, reducing the 30B elastic checkpoint to 18.7 GB in NVFP4.
  • All three precision variants (BF16, FP8, NVFP4) are publicly available on Hugging Face under nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.

Check out the Paper, Elastic Models on Hugging Face BF16, FP8 and NVFP4 Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *