Paged Consideration in Giant Language Fashions LLMs

Paged Consideration in Giant Language Fashions LLMs


When working LLMs at scale, the actual limitation is GPU reminiscence relatively than compute, primarily as a result of every request requires a KV cache to retailer token-level knowledge. In conventional setups, a big mounted reminiscence block is reserved per request primarily based on the utmost sequence size, which results in important unused area and limits concurrency. Paged Consideration improves this by breaking the KV cache into smaller, versatile chunks which might be allotted solely when wanted, much like how digital reminiscence works. It additionally permits a number of requests with the identical beginning immediate to share reminiscence and solely duplicate it when their outputs begin to differ. This strategy significantly improves reminiscence effectivity, permitting considerably larger throughput with little or no overhead.

On this article, we simulate the naive KV cache allocator, construct a working Paged Consideration implementation with a block desk and Copy-on-Write prefix sharing, and measure the utilisation hole throughout batch sizes of 10 to 200 concurrent requests.

import math
import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict
 
random.seed(42)
np.random.seed(42)

Earlier than simulating something, we have to know the way a lot GPU reminiscence a single token really prices. This relies totally on the mannequin’s structure. We use a GPT-style configuration — 32 layers, 32 consideration heads, 128 dimensions per head, saved in fp16. The issue of two on the entrance accounts for each the Key and Worth projections (there is no such thing as a Q cache — queries are recomputed at every step). Multiplying these out provides us 524,288 bytes, or 512 KB, per token. That is the basic unit every part else is constructed on — pre-allocation sizes, web page counts, and wasted reminiscence all scale instantly from this quantity.

NUM_LAYERS  = 32
NUM_HEADS   = 32
HEAD_DIM    = 128
BYTES_FP16  = 2
PAGE_SIZE   = 16    # tokens per web page (vLLM default)
MAX_SEQ_LEN = 2048
 
KV_BYTES_PER_TOKEN = 2 * NUM_LAYERS * NUM_HEADS * HEAD_DIM * BYTES_FP16
KV_MB_PER_TOKEN    = KV_BYTES_PER_TOKEN / 1024 / 1024

The naive strategy is easy: when a request arrives, a contiguous block of GPU reminiscence is allotted sized to the utmost sequence size — 2048 tokens on this case. This occurs as a result of the response size is unknown upfront, so the worst case is reserved.

AVG_RESPONSE is ready to 500, which is a practical common for a manufacturing chatbot. Multiplying by KV_MB_PER_TOKEN provides what is definitely written versus what was locked. The hole is the waste.

The numbers make the issue concrete. Every request pre-allocates 1024 MB however makes use of solely 250 MB — 24.4% utilisation. The remaining 774 MB sits reserved for the whole length of the request, unavailable to every other request. Throughout 100 concurrent customers, that’s 75 GB of GPU reminiscence doing nothing. This isn’t an edge case — it’s the default habits of each system that doesn’t implement paged allocation, and it’s precisely why naive serving programs hit an OOM wall lengthy earlier than the GPU is computationally saturated.

print("=" * 60)
print("SECTION 1 -- Naive KV Cache: The Waste Downside")
print("=" * 60)
 
AVG_RESPONSE = 500   # practical common tokens generated
 
pre_allocated_mb = MAX_SEQ_LEN  * KV_MB_PER_TOKEN
actually_used_mb = AVG_RESPONSE * KV_MB_PER_TOKEN
 
print(f"nKV cache per token    : {KV_BYTES_PER_TOKEN:,} bytes")
print(f"Pre-allocated/request : {pre_allocated_mb:.2f} MB  ({MAX_SEQ_LEN} tokens)")
print(f"Really used/request : {actually_used_mb:.2f} MB  ({AVG_RESPONSE} tokens)")
print(f"Utilisation           : {actually_used_mb / pre_allocated_mb * 100:.1f}%")
print(f"Wasted per request    : {pre_allocated_mb - actually_used_mb:.2f} MB")
 
NUM_USERS = 100
wasted_gb = (pre_allocated_mb - actually_used_mb) * NUM_USERS / 1024
print(f"nAcross {NUM_USERS} concurrent customers → {wasted_gb:.2f} GB wasted")
print("n→ Naive programs utilise solely 20-38% of allotted KV cache reminiscence")
print("  (supply: unique Paged Consideration / vLLM paper)")

Two courses are launched right here to simulate how Paged Consideration really works on the reminiscence administration degree.

PagePool represents the bodily GPU reminiscence pool — a flat array of equal-size pages, every holding 16 tokens. It maintains a free record and a ref depend per web page. When a web page’s ref depend drops to zero, it’s instantly returned to the free record and turns into accessible to any new request. That is the important thing distinction from naive allocation — there aren’t any reserved holes, no fragmentation, and no reminiscence tied to a completed request.

PagedRequest represents a single inference request. It holds a block_table — a listing that maps logical web page indices to bodily web page ids within the pool. Each time generate_token() known as and the token depend crosses a web page boundary, a brand new bodily web page is claimed from the pool. No reminiscence is touched earlier than it’s wanted.

5 requests are run with token counts of 320, 48, 160, 96, and 272. The output reveals pages allotted proportionally to precise utilization — req-1 with 48 tokens will get 3 pages, req-0 with 320 tokens will get 20. When req-1 is freed, its 3 pages go straight again to the pool and are instantly reusable. The pool utilisation at 10.9% seems to be low solely as a result of 512 pages have been provisioned for five small requests — in a completely loaded manufacturing pool it will sit close to the 98% vary seen in Part 4. The “0 tokens wasted” within the last-page column is a seed artifact — all 5 token counts occur to be precise multiples of 16. In apply, the common last-page waste is PAGE_SIZE / 2 = 8 tokens per request.

print("n" + "=" * 60)
print("SECTION 2 -- Paged Consideration: Pages + Block Desk")
print("=" * 60)
 
"""
As a substitute of 1 massive contiguous block per request:
  - KV cache is break up into fixed-size pages (PAGE_SIZE tokens every)
  - Pages are allotted on demand, can reside anyplace in GPU reminiscence
  - Every request retains a block_table: logical index → bodily web page id
"""
 
class PagePool:
    def __init__(self, total_pages):
        self.free      = record(vary(total_pages))
        self.complete     = total_pages
        self.ref_count = defaultdict(int)
 
    def allocate(self):
        if not self.free:
            increase MemoryError("OOM -- no free pages")
        pid = self.free.pop(0)
        self.ref_count[pid] = 1
        return pid
 
    def launch(self, pid):
        self.ref_count[pid] -= 1
        if self.ref_count[pid] <= 0:
            self.free.append(pid)
            del self.ref_count[pid]
 
    def share(self, pid):
        """Increment ref depend -- one other request is sharing this web page."""
        self.ref_count[pid] += 1
 
    def cow_copy(self, pid):
        """CoW: allocate a brand new web page, decrement ref on the previous one."""
        new_pid = self.allocate()
        self.launch(pid)
        return new_pid
 
    @property
    def utilisation(self):
        return (self.complete - len(self.free)) / self.complete * 100
 
 
class PagedRequest:
    def __init__(self, req_id, pool: PagePool):
        self.id          = req_id
        self.pool        = pool
        self.block_table = []   # logical index → bodily web page id
        self.tokens      = 0
 
    def generate_token(self):
        if self.tokens % PAGE_SIZE == 0:   # web page boundary → allocate new web page
            self.block_table.append(self.pool.allocate())
        self.tokens += 1
 
    def free(self):
        for pid in self.block_table:
            self.pool.launch(pid)
        self.block_table.clear()
 
 
pool = PagePool(total_pages=512)
requests = [PagedRequest(f"req-{i}", pool) for i in range(5)]
token_counts = [320, 48, 160, 96, 272]
 
for req, n in zip(requests, token_counts):
    for _ in vary(n):
        req.generate_token()
 
print("nRequest state after era:")
print(f"  {'ID':<10} {'Tokens':>8} {'Pages':>7} {'Final-page waste':>16}")
for req in requests:
    waste = req.tokens % PAGE_SIZE
    waste = PAGE_SIZE - waste if waste else 0
    print(f"  {req.id:<10} {req.tokens:>8} {len(req.block_table):>7} {waste:>16} tokens")
 
print(f"nPool utilisation : {pool.utilisation:.1f}%")
requests[1].free()
print(f"After releasing req-1 → utilisation: {pool.utilisation:.1f}%  (pages instantly reusable)")

In manufacturing, practically each request to a deployed LLM carries the identical system immediate — the directions that outline the mannequin’s habits. Underneath naive allocation, every of these requests shops its personal full copy of the system immediate’s KV cache. With 10 concurrent requests and a 200-token system immediate, that’s 10 similar copies of the identical knowledge occupying separate reminiscence areas.

The identical PagePool from Part 2 is reused right here, prolonged with two strategies: share() increments a web page’s ref depend with out allocating something new, and cow_copy() allocates a contemporary web page and decrements the ref depend on the unique. A brand new pool is instantiated and the system immediate is encoded into 13 pages — math.ceil(200 / 16). Every of the ten consumer requests then calls share() on all 13 pages, pointing their block tables on the identical bodily reminiscence. No new pages are allotted. The ref depend on every shared web page merely rises to 11.

The financial savings are rapid: naive allocation would require 130 pages throughout 10 requests. With CoW, solely 13 bodily pages exist. That’s 936 MB saved from a single shared prefix.

When req-3 generates its first distinctive token, cow_copy() known as on its final shared web page — web page 12. A brand new web page 13 is allotted as req-3’s non-public copy, and the ref depend on web page 12 drops by one. The opposite 9 requests proceed pointing at web page 12, fully unaffected. That is the CoW contract: shared till divergence, non-public solely when mandatory.

print("n" + "=" * 60)
print("SECTION 3 -- Copy-on-Write: Shared System Prompts")
print("=" * 60)
 
"""
If N requests share a system immediate, naive allocation shops N copies.
With CoW, all requests level to the SAME bodily pages.
A personal copy is made solely when a request writes a diverging token.
"""
 
cow_pool    = PagePool(total_pages=512)
SYSTEM_TOKENS = 200
system_pages  = math.ceil(SYSTEM_TOKENS / PAGE_SIZE)
shared_pids   = [cow_pool.allocate() for _ in range(system_pages)]
print(f"nSystem immediate → {system_pages} shared pages: {shared_pids}")
 
N = 10
user_tables = []
for i in vary(N):
    desk = record(shared_pids)
    for pid in shared_pids:
        cow_pool.share(pid)     # ref depend up -- no bodily copy
    user_tables.append(desk)
 
saved_mb = (system_pages * N - system_pages) * PAGE_SIZE * KV_MB_PER_TOKEN
print(f"nStoring system immediate for {N} requests:")
print(f"  Naive : {system_pages * N} pages  ({system_pages * N * PAGE_SIZE * KV_MB_PER_TOKEN:.1f} MB)")
print(f"  CoW   : {system_pages} pages   ({system_pages * PAGE_SIZE * KV_MB_PER_TOKEN:.1f} MB)")
print(f"  Saved : {saved_mb:.1f} MB")
 
old_pid                 = user_tables[3][-1]
new_pid                 = cow_pool.cow_copy(old_pid)
user_tables[3][-1]      = new_pid
print(f"nReq-3 diverges → CoW: previous web page {old_pid} → new web page {new_pid}")
print(f"All different {N-1} requests nonetheless share web page {old_pid} unaffected")

Two features are outlined to measure utilisation below every strategy throughout totally different batch sizes.

naive_utilisation attracts token counts from a standard distribution with avg=500 and std=200, clipped to [200, 2048]. This displays a practical manufacturing distribution — most responses fall between 200 and 800 tokens, with occasional lengthy ones. For every request, the complete 2048-slot block is pre-allocated regardless. Utilisation is then actual_tokens_sum / (2048 × n) — the ratio of what was written to what was reserved.

paged_utilisation takes the identical precise token counts however computes what number of pages every request would wish — ceil(tokens / 16). The one waste is the unfilled tail of every request’s final web page, which averages 8 tokens. Utilisation is actual_tokens_sum / (pages_allocated × 16).

The outcomes are run throughout batch sizes of 10, 25, 50, 100, and 200. Naive utilisation hovers round 24% throughout all batch sizes — with some variance at smaller batches resulting from sampling noise — which is precisely avg / max_seq = 500 / 2048. It doesn’t enhance with scale as a result of the waste is structural, not statistical.

Paged utilisation sits flat at ~98.5% no matter batch measurement, as a result of the waste per request is bounded by a single partial web page and doesn’t scale with max_seq_len in any respect. The hole between the 2 numbers — roughly 74 proportion factors — is instantly what permits vLLM to suit 2–4× extra concurrent requests into the identical GPU reminiscence.

print("n" + "=" * 60)
print("SECTION 4 -- Utilisation: Naive vs Paged")
print("=" * 60)
 
def naive_utilisation(n, max_seq=2048, avg=500, std=200):
    precise = np.clip(np.random.regular(avg, std, n).astype(int), 200, max_seq)
    return precise.sum() / (max_seq * n) * 100, precise
 
def paged_utilisation(actual_tokens, page_size=PAGE_SIZE):
    pages = np.ceil(actual_tokens / page_size).astype(int)
    return actual_tokens.sum() / (pages * page_size).sum() * 100
 
batch_sizes = [10, 25, 50, 100, 200]
naive_u, paged_u = [], []
 
print(f"n  {'Batch':>6}   {'Naive':>8}   {'Paged':>8}")
for bs in batch_sizes:
    nu, precise = naive_utilisation(bs)
    pu = paged_utilisation(precise)
    naive_u.append(nu)
    paged_u.append(pu)
    print(f"  {bs:>6}   {nu:>7.1f}%   {pu:>7.1f}%")

Try the Full Notebook hereAdditionally, be happy to comply with us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their software in numerous areas.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *