Training a family of large language models (LLMs) has always come with a painful multiplier: every model variant in the family—whether 8B, 30B, or 70B—typically requires its own full training run, its own storage, and its own deployment stack. For a dev team running inference at scale, this means multiplying compute costs by the number of model sizes they want to support. NVIDIA researchers are now proposing a different approach called Star Elastic.
Star Elastic is a post-training method that embeds multiple nested submodels—at different parameter budgets—inside a single parent reasoning model, using a single training run. Applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variants live in one checkpoint and can be extracted without any additional fine-tuning.
What does “Nested” Actually Mean here
If you haven’t encountered elastic or nested architectures before, the idea is this: instead of training three separate 30B, 23B, and 12B models, you train one model that contains the smaller ones as subsets of itself. The smaller submodels reuse the most important weights from the parent, identified through a process called importance estimation.
Star Elastic scores each model component: embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels by how much they contribute to model accuracy. Components are then ranked and sorted, so smaller-budget submodels always use the highest-ranked contiguous subset of components from the larger model. This property is called nested weight-sharing.
The method supports nesting along multiple axes: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba heads and head channels, MoE expert count, and FFN intermediate dimension. For MoE layers specifically, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP), which ranks experts by both routing gate values and expert output magnitudes—a more principled signal than naive frequency-based pruning, which ignores how much each expert actually contributes to the layer output.
A Learnable Router, Not a Fixed Compression Recipe
A key distinction from prior compression methods like Minitron is that Star Elastic uses an end-to-end trainable router to determine the nested submodel architectures. The router takes a target budget (e.g., “give me a 2.8B active parameter model”) as a one-hot input and outputs differentiable masks that select which components are active at that budget level. These masks are trained jointly with the model through Gumbel-Softmax, which allows gradient flow through discrete architectural decisions.
The loss function combines knowledge distillation (KD) where the non-elastified parent model acts as the teacher with a router loss that penalizes deviation from the target resource budget (parameter count, memory, or latency). This means the router learns to make architecture choices that actually improve accuracy under KD, rather than just minimizing a proxy metric.
Training uses a two-stage curriculum: a short-context phase (sequence length 8,192 tokens) with uniform budget sampling, followed by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended context phase is critical for reasoning performance. The research team’s ablations on Nano v2—explicitly reproduced as the empirical basis for the same curriculum choice on Nano v3 show gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone, motivating its use here.
Elastic Budget Control: Different Models for Different Reasoning Phases
Existing budget control in reasoning models including Nemotron Nano v3’s own default behavior works by capping the number of tokens generated during a phase before forcing a final answer. This approach uses the same model throughout. Star Elastic unlocks a different strategy: using different nested submodels for the thinking phase versus the answering phase.
The researchers evaluated four configurations. The optimal one, called ℳS → ℳL (small model for thinking, large model for answering), allocates a cheaper model to generate extended reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration in particular advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency compared to default Nemotron Nano v3 budget control. The intuition: reasoning tokens are high-volume but tolerant of some capacity reduction; the final answer requires higher precision.
Quantization Without Breaking the Nested Structure
A naive approach to deploying a quantized elastic model would be to quantize each variant separately after slicing. That breaks the nested weight-sharing property and requires a separate quantization pass per size. Instead, Star Elastic applies Quantization-Aware Distillation (QAD) directly on the elastic checkpoint, preserving the nested mask hierarchy throughout.
For FP8 (E4M3 format), post-training quantization (PTQ) is sufficient, recovering 98.69% of BF16 accuracy on the 30B variant. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone causes a 4.12% average accuracy drop, so a short nested QAD phase (~5B tokens at 48K context) brings recovery back to 97.79% for the 30B variant. In both cases, zero-shot slicing of the 23B and 12B variants from the single quantized checkpoint is preserved.
The memory implications are significant. Storing separate 12B, 23B, and 30B BF16 checkpoints requires 126.1 GB; the single elastic checkpoint requires 58.9 GB. The 30B NVFP4 elastic checkpoint fits in 18.7 GB, enabling the 12B NVFP4 variant to run on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 tokens/s, a 3.4× throughput improvement over the 30B BF16 baseline.
Depth vs. Width: Why Star Elastic Compresses Width
One design choice worth calling out explicitly: the research team compared two compression strategies—removing layers entirely (depth compression) versus reducing internal dimensions like hidden size, expert count, and head count (width compression). With a 15% parameter reduction and 25B tokens of knowledge distillation, width compression recovered 98.1% of baseline performance while depth compression recovered only 95.2%, with noticeable degradation on HumanEval and MMLU-Pro. As a result, Star Elastic prioritizes width-based elasticity for its main results, though depth compression (layer skipping) remains available as a mechanism for extreme latency-constrained scenarios.
On the evaluation suite—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, and Tau Bench—the Elastic-30B variant matches its parent Nemotron Nano v3 30B on most benchmarks, while the Elastic-23B and Elastic-12B variants remain competitive against independently trained models of similar sizes. The Elastic-23B notably scores 85.63 on AIME-2025 versus Qwen3-30B-A3B’s 80.00, despite having fewer active parameters.
On training cost, the research team reports a 360× token reduction compared to pretraining each variant from scratch, and a 7× reduction over prior state-of-the-art compression methods that require sequential distillation runs per model size. The 12B variant runs at 2.4× the throughput of the 30B parent on an H100 GPU at bfloat16 with the same input/output sequence lengths.
How to Use NVIDIA Star Elastic
Step-by-Step Guide
Nemotron Nano v3 Elastic — 30B / 23B / 12B in one checkpoint · BF16 / FP8 / NVFP4
Step 1 of 5
Key Takeaways
- Star Elastic trains 30B, 23B, and 12B nested reasoning models from a single 160B-token post-training run, achieving a 360× token reduction over pretraining from scratch.
- Elastic budget control (23B for thinking, 30B for answering) improves the accuracy–latency Pareto frontier by up to 16% accuracy and 1.9× latency gains.
- A learnable router with Gumbel-Softmax enables end-to-end trainable architecture selection, eliminating the need for separate compression runs per model size.
- Nested QAD preserves zero-shot slicing across FP8 and NVFP4 quantized checkpoints, reducing the 30B elastic checkpoint to 18.7 GB in NVFP4.
- All three precision variants (BF16, FP8, NVFP4) are publicly available on Hugging Face under
nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.
Check out the Paper, Elastic Models on Hugging Face BF16, FP8 and NVFP4 . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
