Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference

Scaling large language models (LLMs) is expensive. Every token processed during inference and every gradient computed during training flows through feedforward layers that account for over two-thirds of model parameters and more than 80% of total FLOPs in larger models. A team researchers from Sakana AI and NVIDIA have worked on a new research that directly targets this bottleneck — not by changing the architecture, but by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.

Sparsity Exists, But GPUs Ignore It

Inside a transformer’s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire — the rest produce zero after passing through the activation function. This is called activation sparsity, and prior work has documented this phenomenon in models with ReLU activations.

The frustrating reality is that this theoretical savings rarely translates into actual speedups. NVIDIA GPUs are heavily optimized for dense matrix multiplications using Tensor Cores, which operate on large contiguous tiles of data. Traditional sparse formats like ELLPACK (ELL) require a separate kernel pass to convert activations from dense to sparse representation, and that conversion overhead often cancels out what’s saved by skipping the zeros.

Critically, prior work on sparse LLM kernels (including TurboSparse, ProSparse, and Q-Sparse) has focused on memory-bound GEMV operations — the single- or few-token inference regime. The research team instead targets compute-bound GEMM operations in the batched setting with thousands of input tokens, where dense baselines on modern devices can execute orders-of-magnitude higher FLOP/s with large tiles and Tensor Cores. That is a fundamentally harder problem, and the reason prior approaches didn’t generalize to batched training or high-throughput inference.

01 — The Problem

Feedforward layers dominate LLM cost — and most of that work is wasted.

> ⅔

of all model parameters live in feedforward layers

80%+

of total FLOPs consumed by feedforward layers

99%+

of hidden activations can be zero with no accuracy drop

For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called activation sparsity — and it has historically been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.

Prior sparse LLM kernels (TurboSparse, ProSparse, Q-Sparse) only targeted single-token GEMV operations. Sakana AI and NVIDIA tackle the harder problem: batched GEMM with thousands of tokens — the regime that covers both training and high-throughput inference.

02 — The Innovation

TwELL: a sparse format built around how GPU kernels actually work.

Old Way — ELL

Row-wide packing, costly to build

Standard ELLPACK packs non-zeros row-by-row across the entire matrix. To construct it from a tiled matmul output you need a separate kernel launch, a full global memory read, and synchronization across all CTAs. Those overheads cancel out the savings from skipping zeros.

New Way — TwELL

Tile-wise packing, built in the epilogue

TwELL partitions columns into horizontal tiles matching the matmul kernel’s tile size T_n. Non-zeros are packed locally within each tile. By matching dimensions, TwELL is constructed inside the existing gate projection kernel epilogue — no extra kernel, no extra memory read, no synchronization overhead.

The inference pipeline uses one fused kernel that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.

For training, a hybrid sparse format dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform — max non-zeros per row can be orders of magnitude above the average — so the hybrid design handles this without becoming brittle.

03 — Training Recipe

Two changes to your training config. Nothing else.

Replace SiLU with ReLU as the gate activation function. ReLU produces exact zeros for negative inputs — this is what enables unstructured sparsity. No other architectural change is needed. (Unregularized ReLU sits slightly below SiLU on task accuracy: 46.4% vs 47.1% on the 1.5B model, offset by the efficiency gains.)

Add an L1 loss term on the hidden feedforward activations, averaged over all tokens and hidden dimensions across all layers. Recommended coefficient: L1 = 2×10⁻⁵. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.

Sparsity stabilizes fast. The non-zero count settles within ~1,000 training steps (~1B tokens). The training kernels deliver memory and throughput benefits for almost the entire training run, not just toward the end.

Watch Out

At L1 = 2×10⁻⁵, over 30% of neurons become permanently inactive (dead neurons) on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation — yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.

04 — Benchmark Results

Accuracy preserved. Efficiency scales up with model size.

Model	Accuracy	Inference	Energy / tok	Training	Peak Mem
0.5B	40.4% → 40.4%	+17.0%	−11.8%	−1.5%	−19.2%
1B	44.6% → 44.7%	+18.1%	−14.6%	+7.1%	−25.5%
1.5B	46.4% → 46.2%	+18.8%	−15.0%	+11.6%	−28.1%
2B	49.1% → 48.8%	+20.5%	−17.0%	+21.9%	+22.3% *

All results at L1 = 2×10⁻⁵ on a single node of eight H100 PCIe GPUs, sequence length 2048. Efficiency gains grow with scale — average non-zero activations drop from 39 (0.5B) to 24 (2B), giving the sparse kernels proportionally more computation to skip. * The 2B sparse model uses a larger micro-batch enabled by reduced activation memory, raising peak usage while improving throughput.

05 — Key Findings

What the paper reveals about where sparsity actually lives.

◆

Early layers are least active. In a 28-layer 1.5B model, the first two layers have the fewest non-zero activations. Activity peaks in the early-to-middle layers — consistent with prior work showing LLM reasoning and knowledge retrieval concentrate there.

◆

First tokens in a sequence fire far more neurons. The model allocates exponentially more computation to early sequence positions where contextual cues from prior tokens are absent. This non-uniformity is exactly what the sparse kernels exploit for speedups.

◆

Strong inverse correlation between sparsity and speedup. The paper measures a Pearson correlation of −0.996 between each layer’s average non-zero count and its inference speedup contribution. Sparser layers deliver proportionally larger gains.

◆

Larger gains on less specialized hardware. On NVIDIA RTX PRO 6000 (188 SMs vs 114 on H100), training speedups are significantly higher. Dense GEMM is slower on the RTX 6000, while sparse ops run faster — widening the relative advantage of sparsity on accessible hardware.

06 — Get Started

Open-source. All kernels and training code released.

■

Architecture: Works with gated feedforward LLMs — Llama, Qwen, and any Transformer++ design. Non-gated (original transformer) variant also supported: 11.2% inference speedup vs 17.9% for gated at the same L1.

■

Hardware: CUDA kernels written for H100 GPUs using TMA-based pipelining and persistent cooperative design. Gains verified on RTX PRO 6000 with even larger speedups.

■

Existing models: Fine-tuning via sparsification approaches is flagged as a future direction for bringing these kernels to pretrained dense models — not yet demonstrated in this paper.

So, What Exactly is Proposed

The research team addresses this mismatch with two primary contributions: a new sparse data format called TwELL (Tile-wise ELLPACK), and a set of custom CUDA kernels for inference and training built around it.

TwELL is designed around one key insight: modern matmul kernels already divide computation across small 2D tiles (of size T_m × T_n) assigned to individual cooperative thread arrays (CTAs). Standard ELL packs non-zeros row-by-row across the entire matrix, which requires global synchronization to construct from tiled matmul outputs. TwELL instead partitions the columns of the gate activation matrix into horizontal tiles of size T, and within each tile stores non-zero values and their indices in a local ELL-style layout. By matching the tile dimension T to the column tile size T_n of the matmul kernel, TwELL can be produced directly in the epilogue of the gate projection kernel — no extra kernel launch, no additional global memory read, no synchronization across CTAs. The format uses a compression factor C such that T/C exceeds the maximum non-zeros per tile, and packages values, indices, and non-zero counts into a single 32-bit matrix for locality.

https://pub.sakana.ai/sparser-faster-llms/

For inference, a single fused kernel takes the gate activations in TwELL format and performs the up and down projections together. Each CTA handles one row of inputs, iterating first statically over column tiles and then dynamically over each tile’s non-zero count. For each active neuron at index n, the CTA loads the n-th column of the up projection weight matrix W_u and the n-th row of the down projection weight matrix W_d, computes the dot product, and accumulates into the output. The intermediate hidden state h_u is never materialized in global memory, cutting DRAM traffic significantly.

For training, the situation is more complex because sparsity patterns are highly non-uniform across tokens and layers — the maximum non-zeros per row can be orders of magnitude above the average, making a pure ELL layout brittle. The research team introduces a hybrid sparse format that dynamically routes rows either into a compact ELL matrix (for rows below a non-zero threshold) or into a dense backup matrix (for overflow rows). This allows efficient sparse gradient computation in the backward pass without requiring dense-to-dense matmuls for most rows. The team also releases kernels for the original non-gated transformer feedforward block; at the recommended sparsity level, the non-gated variant achieves an 11.2% inference speedup compared to 17.9% for the gated design.

Just ReLU and L1 Regularization

The sparsity induction strategy is deliberately minimal. The research team used ReLU as the gate activation function and add a simple L1 loss term on the hidden feedforward activations, controlled by a coefficient L1. No other architectural changes are required, and the research team reported that adding L1 regularization did not affect other hyperparameters (learning rate, weight decay, optimizer settings).

Models were trained on the fineweb dataset (a deduplicated fineweb-edu split) at chinchilla-optimal token counts — approximately 10B tokens for a 0.5B model up to 40B tokens for a 2B model — with a context length of 2048 and a batch size of 1M tokens.

Testing eight L1 coefficient values on a 1.5B parameter model, they find that up to L1 = 3 × 10⁻⁵, there is essentially no drop in mean task accuracy across seven downstream benchmarks (ARC Easy/Challenge, HellaSwag, OpenBookQA, PIQA, WinoGrande, CommonsenseQA), with final cross-entropy increasing by less than 2% relative to the unregularized baseline. The recommended setting L1 = 2 × 10⁻⁵ reduces average non-zero activations from 911 per layer (in the unregularized 1.5B model with a feedforward hidden dimension of 5632) down to just 29 — roughly 99.5% sparsity — with no measurable downstream performance loss.

One important key point: at L1 = 2 × 10⁻⁵, over 30% of neurons become permanently inactive (dead neurons) on average across layers. The research team explores two mitigation strategies — scheduling the L1 warmup and applying targeted reinitialization to dead gate projection columns — and finds that the reinitialization approach maintains similar sparsity levels while slightly improving both downstream accuracy and efficiency (+19.1% inference speedup vs. +17.9% baseline). This is listed as a direction for future work.

Measured Efficiency Gains

The efficiency results are reported on a single node of eight H100 PCIe GPUs, with a fixed sequence length of 2048 tokens. For the cross-scale comparison, the L1 coefficient is fixed at 2 × 10⁻⁵.

At smaller scales, sparsity delivers clear peak memory reductions during training:

Model	Dense Peak Memory	Sparse Peak Memory	Change
0.5B	26.2 GB	21.2 GB	−19.2%
1B	44.5 GB	33.1 GB	−25.5%
1.5B	62.8 GB	45.1 GB	−28.1%

At 2B parameters, the sparse model uses a larger micro-batch (enabled by reduced activation memory at that scale), which results in higher peak GPU memory (46.7 → 57.1 GB) but faster training throughput (+21.9%). The efficiency gains on all metrics for the 2B model:

Forward execution throughput: 87.8 → 106 input tokens/ms (+20.5%)
Energy per token: 7.85 → 6.51 mJ (−17.0%)
Training step throughput: 22.4 → 27.3 input tokens/ms (+21.9%)

Across the full 0.5B–2B range, mean task accuracy of sparse and non-sparse models remains statistically indistinguishable. Efficiency benefits grow with model scale: larger models naturally develop lower average non-zero counts (dropping from 39 at 0.5B to 24 at 2B), which means the sparse kernels skip a proportionally greater share of computation.

Training speedups are also observed on NVIDIA’s RTX PRO 6000 GPU, where the larger Streaming Multiprocessor count (188 vs. 114 on H100) allows sparse operations to run faster — suggesting these gains extend to less specialized hardware.

What the Sparsity Patterns Reveal

Sparsity is not uniform: the first two layers of a 28-layer 1.5B model are the least active, followed by a pronounced peak in non-zero activations across early-middle layers — consistent with prior work suggesting this is where much of LLM reasoning and knowledge retrieval occurs. Separately, the first tokens in an input sequence activate far more neurons than later tokens, with an exponential decrease thereafter. The research team observed an inverse Pearson correlation of −0.996 between each layer’s average non-zero count and its inference speedup contribution, confirming that the sparsest layers provide the greatest per-layer gains.

Check out the Paper, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

CEO & Founder

Moiz Ahmad

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Sparsity Exists, But GPUs Ignore It

So, What Exactly is Proposed

Just ReLU and L1 Regularization

Measured Efficiency Gains

What the Sparsity Patterns Reveal

Leave a Reply Cancel reply

Trending News

Crypto

World

National

CEO & Founder

Sparsity Exists, But GPUs Ignore It

So, What Exactly is Proposed

Just ReLU and L1 Regularization

Measured Efficiency Gains

What the Sparsity Patterns Reveal

Leave a Reply Cancel reply

Related News

Popular News

Trending News

Recent News