Zlab Princeton researchers have launched LLM-Pruning Collection, a JAX primarily based repository that consolidates main pruning algorithms for big language fashions right into a single, reproducible framework. It targets one concrete aim, make it simple to match block degree, layer degree and weight degree pruning strategies underneath a constant coaching and analysis stack on each GPUs and TPUs.
What LLM-Pruning Assortment Incorporates?
It’s described as a JAX primarily based repo for LLM pruning. It’s organized into three foremost directories:
pruningholds implementations for a number of pruning strategies: Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama and LLM-Pruner.coachingoffers integration with FMS-FSDP for GPU coaching and MaxText for TPU coaching.evalexposes JAX suitable analysis scripts constructed round lm-eval-harness, with speed up primarily based help for MaxText that offers about 2 to 4 occasions speedup.
Pruning Strategies Lined
LLM-Pruning Assortment spans a number of households of pruning algorithms with completely different granularity ranges:
Minitron
Minitron is a sensible pruning and distillation recipe developed by NVIDIA that compresses Llama 3.1 8B and Mistral NeMo 12B to 4B and 8B whereas preserving efficiency. It explores depth pruning and joint width pruning of hidden sizes, consideration and MLP, adopted by distillation.
In LLM-Pruning Assortment, the pruning/minitron folder offers scripts reminiscent of prune_llama3.1-8b.sh which run Minitron model pruning on Llama 3.1 8B.
ShortGPT
ShortGPT is predicated on the remark that many Transformer layers are redundant. The tactic defines Block Affect, a metric that measures the contribution of every layer after which removes low affect layers by direct layer deletion. Experiments present that ShortGPT outperforms earlier pruning strategies for a number of selection and generative duties.
Within the assortment, ShortGPT is applied by means of the Minitron folder with a devoted script prune_llama2-7b.sh.
Wanda, SparseGPT, Magnitude
Wanda is a submit coaching pruning technique that scores weights by the product of weight magnitude and corresponding enter activation on a per output foundation. It prunes the smallest scores, requires no retraining and induces sparsity that works nicely even at billion parameter scale.
SparseGPT is one other submit coaching technique that makes use of a second order impressed reconstruction step to prune massive GPT model fashions at excessive sparsity ratios. Magnitude pruning is the classical baseline that removes weights with small absolute worth.
In LLM-Pruning Assortment, all three stay underneath pruning/wanda with a shared set up path. The README features a dense desk of Llama 2 7B outcomes that compares Wanda, SparseGPT and Magnitude throughout BoolQ, RTE, HellaSwag, Winogrande, ARC E, ARC C and OBQA, underneath unstructured and structured sparsity patterns reminiscent of 4:8 and a pair of:4.
Sheared Llama
Sheared LLaMA is a structured pruning technique that learns masks for layers, consideration heads and hidden dimensions after which retrains the pruned structure. The unique launch offers fashions at a number of scales together with 2.7B and 1.3B.
The pruning/llmshearing listing in LLM-Pruning Assortment integrates this recipe. It makes use of a RedPajama subset for calibration, accessed by means of Hugging Face, and helper scripts to transform between Hugging Face and MosaicML Composer codecs.
LLM-Pruner
LLM-Pruner is a framework for structural pruning of enormous language fashions. It removes non essential coupled constructions, reminiscent of consideration heads or MLP channels, utilizing gradient primarily based significance scores after which recovers efficiency with a brief LoRA tuning stage that makes use of about 50K samples. The gathering contains LLM-Pruner underneath pruning/LLM-Pruner with scripts for LLaMA, LLaMA 2 and Llama 3.1 8B.
Key Takeaways
- LLM-Pruning Assortment is a JAX primarily based, Apache-2.0 repo from zlab-princeton that unifies fashionable LLM pruning strategies with shared pruning, coaching and analysis pipelines for GPUs and TPUs.
- The codebase implements block, layer and weight degree pruning approaches, together with Minitron, ShortGPT, Wanda, SparseGPT, Sheared LLaMA, Magnitude pruning and LLM-Pruner, with technique particular scripts for Llama household fashions.
- Coaching integrates FMS-FSDP on GPU and MaxText on TPU with JAX suitable analysis scripts constructed on lm-eval-harness, giving roughly 2 to 4 occasions sooner eval for MaxText checkpoints by way of speed up.
- The repository reproduces key outcomes from prior pruning work, publishing aspect by aspect “paper vs reproduced” tables for strategies like Wanda, SparseGPT, Sheared LLaMA and LLM-Pruner so engineers can confirm their runs in opposition to identified baselines.
Take a look at the GitHub Repo. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Shobha is a knowledge analyst with a confirmed monitor file of creating modern machine-learning options that drive enterprise worth.

