Prime Intellect has released prime-rl version 0.6.0. The framework targets reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. It focuses on heavy agentic workloads, like long-horizon software-engineering tasks.
The research team trained GLM-5 on SWE tasks at up to 131k sequence length. Step times stayed under five minutes. The batch size was 256 rollouts. The run used only 28 H200 nodes.
TL;DR
- prime-rl 0.6.0 trains trillion-parameter MoE models on agentic RL workloads.
- GLM-5 trained on SWE at 131k sequence length, sub-5-minute steps, 28 H200 nodes.
- Asynchronous RL disaggregates trainer and inference for independent optimization.
- Inference uses FP8, Wide EP, P/D disaggregation, KV offloading, and router replay.
- Training uses 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.
What is prime-rl 0.6.0?
prime-rl is an open framework for asynchronous reinforcement learning. It post-trains large open-source models on agentic tasks. Version 0.6.0 extends this to trillion-parameter MoE scale.
The example model in the announcement is zai-org/GLM-5.1. The optimizations also apply to other large MoE models. Examples include moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.
A full GLM-5.1 run starts with one command on a Slurm cluster.
uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmdRole of asynchronous RL
Agentic tasks have long-tail outliers. Some coding rollouts run for hours. Waiting for them before each policy update would idle GPUs.
Asynchronous RL avoids this. The trainer and inference systems are disaggregated. They run and scale independently. The inference policy updates as soon as the optimizer step finishes.
There is one synchronization point: the policy update. prime-rl pushes new weights as soon as they exist. Already-dispatched rollouts keep their active prefix cache. So a single rollout may mix tokens from several policy versions.
New rollouts behave differently. They repopulate their own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too old a policy are dropped. The max_off_policy_steps value controls that threshold.
Inference optimizations
Inference is usually the throughput bottleneck in an RL system. prime-rl optimizes for throughput, while keeping latency bounded.
FP8 inference: Lower precision speeds up prefill and decode. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.
Wide Expert Parallelism: Wide EP spreads experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU holds separate experts and serves as an endpoint. Synchronization happens per-layer, through dispatch and combine operations.
Prefill and Decode Disaggregation: Some model↔env pairs hit a 4:1 prefill:decode token ratio. Shared workers would inflate end-to-end latency. That reduces the benefits of PipelineRL. P/D disaggregation separates prefill and decode workers. Long tool outputs then stop throttling decode workers.
KV cache management: High concurrency needs large KV cache space. prime-rl supports tiered offloading to CPU and disk. vLLM native offloading creates one pool per worker. Mooncake Store instead pools RAM and disk across all nodes centrally.
Request routing: prime-rl ships a fork of vllm-router by default. It also supports the NVIDIA Dynamo router as a drop-in. Routers score workers using KV cache reuse, queue depth, and live load.
Router replay (R3): Trainer↔inference mismatch silently kills training. Router replay captures inference routing decisions. It replays them directly on the trainer. This cuts KL mismatch by roughly an order of magnitude. Routed experts have shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations handle the processing.
Training optimizations
The trainer builds on torchtitan, a PyTorch-native training codebase. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three.
| Strategy | What it shards | Primary use | Key detail |
|---|---|---|---|
| FSDP (FSDP2) | Parameters, gradients, optimizer states | Baseline memory amortization | Gathers weights on demand per layer via fully_shard |
| Expert Parallelism (EP) | Experts within a layer | Shrinks active layer memory | all2all dispatch/combine; torch-native or DeepEP |
| Context Parallelism (CP) | The sequence dimension | Long-context activation memory | Ulysses (default) or Ring Attention |
EP exists because layers stay huge after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather needs roughly 40GB. Overlapping one layer pushes that near 80GB. Setting EP=8 dispatches tokens instead of gathering full experts. torch-native all2all is slightly faster within one node. DeepEP wins when EP spans multiple nodes.
CP matters at 131k+ sequence length. There, activations dominate memory, not parameters. GLM-5 uses DSA, which neither Ulysses nor Ring Attention parallelizes directly. So prime-rl ships a custom context-parallel implementation for it.
FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises throughput, due to quantization overhead. Its real value is matching trainer and inference precision. That reduces KL mismatch and stabilizes training.
Interactive Explainer
Use cases with examples
- Long-horizon SWE agents: Train a model on real repository issues. Rollouts can span 100s of turns and tool calls. P/D disaggregation keeps decode latency predictable here.
- 1T-scale post-training on fewer nodes: The GLM-5 run fit on 28 H200 nodes. Wide EP and KV offloading raise concurrency and throughput.
- Stable agentic RL at scale: Router replay and FP8 training both reduce trainer↔inference KL mismatch. Lower mismatch means steadier training.
Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
