Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate


Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository.

At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.

Why Tokenization Became a Bottleneck

LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.

A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.

Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.

What is Unigram Tokenization?

Unigram tokenization was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a most-probable-path problem. Each vocabulary token has a learned log-probability. The tokenizer picks the segmentation whose token scores sum to the highest value.

The algorithm used to find that best path is the Viterbi algorithm, a dynamic programming technique from 1967. Byte positions form graph layers and vocabulary tokens are edges spanning a contiguous byte range. The DP recurrence iterates over byte positions and updates the best-scoring path at each position.

The outer loop runs in linear time relative to input length. The inner loop walks a vocabulary trie (a prefix tree structure) at each byte position. On a 16K-token input, this inner walk executes hundreds of thousands of trie transitions. It is the hot path.

What was Slow in the Hugging Face Implementation

The Hugging Face tokenizers crate is the default Rust tokenizer most teams reach for. Perplexity used it as the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three costly patterns:

BottleneckMechanismMeasured impact
Allocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocations at 514 tokens; 299,171 at 16K
Pointer chase per byteAHashMap at every trie node; 4 dependent loads per byte stepDependent-load latency dominates the hot path
L2 thrashing on long inputsDP table and output buffers freshly allocated each callL2 miss rate climbs from 8% at 128 tokens to 50% at 16K

Per-token allocation is constant: roughly 2 KB and ~18 allocations per token, regardless of input size. The latency problem becomes severe at longer inputs when cumulative allocations overflow the per-core L2 cache.

Establishing a Baseline Before Changing the Trie

Before switching the trie structure, Perplexity first isolated how much cost came from unnecessary work alone. They made a zero-allocation port of the reference: same HashMap trie, but with a caller-owned scratch struct reused across calls and token IDs stored directly in trie nodes (removing the per-match string allocation and secondary hash-map lookup).

This baseline already cut p50 latency to 155 µs at 514 tokens, down from 326 µs in the reference. Instructions retired dropped 2.4x. The remaining cost was the HashMap pointer chase itself, which the next step addressed.

The Three Optimizations

Optimization 1: Double-Array Trie

The Hugging Face trie stores children in a HashMap at every node. Each byte step requires a hash computation, two pointer dereferences, and a heap access. Perplexity replaced this with a double-array trie, the same structure used by SentencePiece and IREE, originally introduced by Aoe in 1989.

A double-array trie encodes the entire trie in two flat integer arrays, base and check. A child lookup is: next = base[node] + byte, then verify check[next] == node. That is two array reads, one integer add, and one comparison, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the whole trie fits in ~9 MB of contiguous memory. The hot working set per encode is on the order of 100 KB, which fits in L2 cache.

Unlike SentencePiece and IREE, which are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie directly in the Viterbi loop and dropped that overhead entirely.

Result at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the original reference.

Optimization 2: Bitmap and Inline Packing

The double-array trie still requires two dependent array loads per byte step: first the parent’s base offset, then the check array to confirm the transition is valid. Perplexity replaced the check array with a per-node bitmap (four 64-bit words, 32 bytes) that records which of the 256 possible bytes have valid child transitions.

A bitmap lookup compiles to a single bit test against one 64-bit word. The check array is used only during trie construction and dropped from the runtime layout entirely.

They also packed all four per-node fields (bitmap, base, token ID, and score) into a single 64-byte cache line, matching CPU cache line width exactly. One trie step now loads a single cache line covering the bitmap for the next-byte check, the base offset for the child slot, and the token ID and score at terminal nodes.

Trade-off: trie size grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The hot working set per encode remains ~100 KB.

Result at 514 tokens: Additional 4.5% wall-clock reduction. L2 accesses dropped from 4.6K to 1.8K per encode.

Optimization 3: Huge Pages for the Trie

At 50 MB, the trie spans roughly 12,000 virtual pages on a default Linux system using 4 KB pages. The first-level data TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a different trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode budget.

Perplexity backed the trie with 2 MB huge pages via mmap with the MAP_HUGETLB flag. The same 50 MB now spans 25 pages, well within the TLB. This requires vm.nr_hugepages configured at boot. In production, 10,561 huge pages are reserved; the trie uses 24.

Result: 3-12% wall-clock reduction depending on input length. The largest gain is at 4,098 tokens (-12.0%), where page-table traffic was actively competing with trie data for L2 bandwidth. Beyond 4K tokens the gain shrinks because L3 misses dominate.

Final Benchmark Results

All measurements are single-threaded, pinned to one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens:

Enginep50 LatencyInstructionsAllocations
Hugging Face (tokenizers crate)349 µs3.60M7,295
SentencePiece (C++)128 µs1.83M1,559
IREE tokenizer (C)112 µs2.28M1
Perplexity (final, all 3 optimizations)~63 µs1.04M0

Across the full optimization sequence, instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Wall-clock matches that ratio at short inputs and widens at long inputs where the reference’s per-token allocations overflow L2 and L3.

One additional finding: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead compared to the native C/C++ binaries. The sentencepiece crate allocates a fresh list of token pieces on each call. The overhead is measurable but amortizes at long inputs.

The final Perplexity encoder produces token-exact output against the reference. In production, it uses rayon to parallelize across cores.

Marktechpost’s Visual Explainer

Open Source Release

Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x

Perplexity reimplemented their Unigram tokenizer from scratch in Rust and open-sourced it in pplx-garden. Three targeted optimizations removed wasted work from the hot path.

5xLower p50 vs HuggingFace tokenizers crate

5-6xCPU utilization reduction in production

0Heap allocations on the hot path

Source: research.perplexity.ai

The Problem

Why CPU Tokenization Became a Bottleneck

LLM inference cost is usually framed around GPU work: KV caches, attention kernels, expert routing. But small models tell a different story.

1

Rerankers and embedders are small

Two to three orders of magnitude smaller than frontier transformers. GPU compute finishes in single-digit milliseconds.

2

Tokenization runs on CPU before each call

Every input passes through CPU-side tokenization first, turning text into vocabulary IDs.

3

Batch size amplifies the cost

A reranker scoring hundreds of documents per request means tokenization runs hundreds of times per query.

Background

What Is Unigram Tokenization?

Introduced by Kudo (2018), implemented in SentencePiece. Perplexity targets XLM-RoBERTa with a 250K-token Unigram vocabulary.

Most-probable-path problem

Each vocabulary token carries a learned log-probability. The tokenizer picks the segmentation whose token scores sum highest.

Viterbi algorithm (1967)

A dynamic programming method that finds the best path. Byte positions are graph layers; vocabulary tokens are edges.

The hot path is the inner trie walk at each byte position. On a 16K-token input, this executes hundreds of thousands of trie transitions and retires tens of millions of instructions per encode.

Root Cause

Three Bottlenecks in the Hugging Face Reference

Measured at 514 tokens (512 + BOS/EOS) on Intel Xeon Platinum 8488C:

BottleneckMechanismImpact
Allocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocs at 514 tokens; 299,171 at 16K
Pointer chase per byteAHashMap at every trie node; 4 dependent loads per stepDependent-load latency dominates
L2 thrashingDP table and output buffers freshly allocated each callL2 miss rate: 8% at 128 tokens, 50% at 16K

Per-token allocation is constant: ~2 KB and ~18 allocations per token regardless of input size.

Step 0: Baseline

Zero-Allocation Port Before Changing the Trie

Before touching the trie structure, Perplexity isolated how much cost came from unnecessary allocations alone. They kept the same HashMap trie but made two changes:

  • Caller-owned scratch struct reused across calls, removing per-encode DP table allocation
  • Token IDs stored directly in trie nodes, removing per-match String allocation and secondary hash-map lookup

Baseline p50

155 µs (-2.1x)

Allocations alone were the dominant cost. Instructions retired dropped 2.4x. The HashMap pointer chase was now the remaining bottleneck.

Optimization 1

Double-Array Trie

The HashMap trie costs 4 dependent loads per byte step. The double-array trie (Aoe, 1989) replaces it with flat integer arrays base and check.

HashMap trie (reference)

Hash byte, load bucket, follow pointer to child, follow pointer to child’s HashMap. 4 dependent loads per step.

Double-array trie

next = base[node] + byte
Verify check[next] == node
2 array reads, 1 add, 1 compare. No hashing.

250K vocab fits in ~9 MB contiguous memory. Hot working set per encode is ~100 KB, fitting in L2 cache. Result: p50 drops from 155 µs to 68 µs, wall-clock 4.8x faster than original reference.

Optimization 2

Bitmap + 64-Byte Cache-Line Packing

The double-array trie still needs two dependent array loads per step. Perplexity replaced the check array with a per-node bitmap.

  • Per-node bitmap: four 64-bit words (32 bytes), one bit per possible byte value. A single bit test replaces the second array load.
  • All four per-node fields (bitmap, base, token ID, score) packed into one 64-byte cache line.
  • One trie step now loads a single cache line covering validity, child offset, and terminal data.

L2 accesses at 514 tokens

4,600 (Darts) vs 1,800 (Bitmap)

Trie size trade-off

~9 MB (Darts) grows to ~50 MB (780K nodes x 64 bytes)

Optimization 3

2 MB Huge Pages for the Trie

At 50 MB with 4 KB pages, the trie spans ~12,000 virtual pages. Intel Sapphire Rapids holds only 96 entries in the first-level data TLB. TLB misses trigger page-table walks.

~9,000 cycles spent in page-table walks per 512-token encode, about 3% of the per-encode budget.

Fix: back the trie with 2 MB huge pages via mmap with MAP_HUGETLB. The same 50 MB spans 25 pages, well within TLB capacity. In production, 10,561 huge pages are reserved; the trie uses 24.

At 514 tokens

65.4 µs without huge pages vs 63.1 µs with (-3.4%)

At 4,098 tokens

773 µs without huge pages vs 679 µs with (-12.0%)

Results

Final Benchmark at 514 Tokens

Single-threaded, pinned core, Intel Xeon Platinum 8488C. 10,000 iterations after 1,000 warmup rounds.

Enginep50 LatencyInstructionsAllocations
Hugging Face (Rust)349 µs3.60M7,295
SentencePiece (C++)128 µs1.83M1,559
IREE tokenizer (C)112 µs2.28M1
Perplexity (final)~63 µs1.04M0

Instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Note: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x overhead vs native binaries due to per-call allocations.

Key Takeaways

What Engineers Should Know

  • CPU tokenization is invisible in GPU profiling traces but real in end-to-end latency for small models.
  • Removing per-encode heap allocations (zero-allocation baseline) cut p50 from 326 µs to 155 µs before any trie change.
  • Double-array trie brought p50 to 68 µs. Bitmap packing and huge pages brought it to ~63 µs.
  • The Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead vs native binaries.
  • Source code is available at github.com/perplexityai/pplx-garden under MIT license.

Production impact

5-6x CPU utilization reduction + double-digit ms off reranker latency

Target model

XLM-RoBERTa, 250K-token SentencePiece Unigram vocabulary


Key Takeaways

  • Perplexity rebuilt their Unigram tokenizer targeting XLM-RoBERTa’s 250K-token SentencePiece vocabulary
  • The new encoder achieves zero steady-state heap allocations and ~63 µs p50 at 514 tokens
  • Three optimizations: double-array trie, bitmap + 64-byte cache-line packing, and 2 MB huge pages for the trie
  • Intermediate result: a zero-allocation HashMap port alone cut p50 from 326 µs to 155 µs before the trie was changed
  • Production impact: 5-6x CPU utilization reduction and double-digit ms reduction in reranker latency

Check out the Repo and Technical detailsAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *