Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Did Google’s TurboQuant Really Remedy AI Reminiscence Crunch?

    April 2, 2026

    Cybersecurity within the age of immediate software program

    April 2, 2026

    3 Methods to Genuinely Acknowledge Your Staff

    April 2, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs
    Machine Learning & Research

    From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

    Oliver ChambersBy Oliver ChambersMarch 31, 2026No Comments18 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Within the earlier article, we noticed how a language mannequin converts logits into possibilities and samples the subsequent token. However the place do these logits come from?

    On this tutorial, we take a hands-on strategy to know the technology pipeline:

    • How the prefill part processes your whole immediate in a single parallel move
    • How the decode part generates tokens one after the other utilizing beforehand computed context
    • How the KV cache eliminates redundant computation to make decoding environment friendly

    By the top, you’ll perceive the two-phase mechanics behind LLM inference and why the KV cache is important for producing lengthy responses at scale.

    Let’s get began.

    From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs
    Picture by Neda Astani. Some rights reserved.

    Overview

    This text is split into three components; they’re:

    • How Consideration Works Throughout Prefill
    • The Decode Part of LLM Inference
    • KV Cache: Learn how to Make Decode Extra Environment friendly

    How Consideration Works Throughout Prefill

    Contemplate the immediate:

    At this time’s climate is so …

    As people, we are able to infer the subsequent token ought to be an adjective, as a result of the final phrase “so” is a setup. We additionally understand it most likely describes climate, so phrases like “good” or “heat” are extra doubtless than one thing unrelated like “scrumptious“.

    Transformers arrive on the similar conclusion via consideration. Throughout prefill, the mannequin processes the complete immediate in a single ahead move. Each token attends to itself and all tokens earlier than it, increase a contextual illustration that captures relationships throughout the total sequence.

    The mechanism behind that is the scaled dot-product consideration formulation:

    $$
    textual content{Consideration}(Q, Ok, V) = mathrm{softmax}left(frac{QK^prime}{sqrt{d_k}}proper)V
    $$

    We are going to stroll via this concretely beneath.

    To make the eye computation traceable, we assign every token a scalar worth representing the knowledge it carries:

    Place Tokens Values
    1 At this time 10
    2 climate 20
    3 is 1
    4 so 5

    Phrases like “is” and “so” carry much less semantic weight than “At this time” or “climate“, and as we’ll see, consideration naturally displays this.

    Consideration Heads

    In actual transformers, consideration weights are steady values realized throughout coaching via the $Q$ and $Ok$ dot product. The habits of consideration heads are realized and often unattainable to explain. No head is hardwired to “attend to even positions”. The 4 guidelines beneath are simplified illustration to make consideration mechanism extra intuitive, whereas the weighted aggregation over $V$ is identical.

    Listed below are the foundations in our toy instance:

    1. Attend to tokens at even quantity positions
    2. Attend to the final token
    3. Attend to the primary token
    4. Attend to each token

    For simplicity on this instance, the outputs from these heads are then mixed (averaged).

    Let’s stroll via the prefill course of:

    At this time

    1. Even tokens → none
    2. Final token → At this time → 10
    3. First token → At this time → 10
    4. All tokens → At this time → 10

    climate

    1. Even tokens → climate → 20
    2. Final token → climate → 20
    3. First token → At this time → 10
    4. All tokens → common(At this time, climate) → 15

    is

    1. Even tokens → climate → 20
    2. Final token → is → 1
    3. First token → At this time → 10
    4. All tokens → common(At this time, climate, is) → 10.33

    so

    1. Even tokens → common(climate, so) → 12.5
    2. Final token → so → 5
    3. First token → At this time → 10
    4. All tokens → common(At this time, climate, is, so) → 9

    Parallelizing Consideration

    If the immediate contained 100,000 tokens, computing consideration step-by-step can be extraordinarily sluggish. Happily, consideration will be expressed as tensor operations, permitting all positions to be computed in parallel.

    That is the important thing concept of prefill part in LLM inference: Once you present a immediate, there are a number of tokens in it and they are often processed in parallel. Such parallel processing helps pace up the response time for the primary token generated.

    To forestall tokens from seeing future tokens, we apply a causal masks, to allow them to solely attend to itself and earlier tokens.

    import torch

     

    tokens = [“Today”, “weather”, “is”, “so”]

    n = len(tokens)

    d_k = 64

     

    V = torch.tensor([[10.], [20.], [1.], [5.]], dtype=torch.float32)

    positions = torch.arange(1, n + 1).float() # 1-based: [1, 2, 3, 4]

    idx = torch.arange(n)

     

    causal_mask = idx.unsqueeze(1) >= idx.unsqueeze(0)

    print(causal_mask)

    Output:

    tensor([[ True, False, False, False],

            [ True, True, False, False],

            [ True, True, True, False],

            [ True, True, True, True]])

    Now, we are able to begin writing the “guidelines” for the 4 consideration heads.

    Reasonably than computing scores from realized $Q$ and $Ok$ vectors, we handcraft them on to match our 4 consideration guidelines. Every head produces a rating matrix of form (n, n), with one rating per query-key pair, which will get masked and handed via softmax to supply consideration weights:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    def selector(situation, dimension):

        “”“Return a (dimension, d_k) tensor of +1/-1 relying on situation.”“”

        val = torch.the place(situation, torch.ones(

            dimension), –torch.ones(dimension))  # (dimension,)

        # (dimension, d_k)

        return val.unsqueeze(1).increase(dimension, d_k).contiguous()

     

    # Shared question: each row asks for a property, and Ok encodes which tokens match it.

    Q = torch.ones(n, d_k)

     

    # Head 1: choose even positions

    # Ok says whether or not every token is at a good place.

    K1 = selector(positions % 2 == 0, n)

    scores1 = (Q @ K1.T) / (d_k ** 0.5)

     

    # Head 2: choose the final token

    # Ok says whether or not every token is the final one.

    K2 = selector(positions == n, n)

    scores2 = (Q @ K2.T) / (d_k ** 0.5)

     

    # Head 3: choose the primary token

    # Ok says whether or not every token is the primary one.

    K3 = selector(positions == 1, n)

    scores3 = (Q @ K3.T) / (d_k ** 0.5)

     

    # Head 4: choose all seen tokens uniformly

    # Ok says all of the tokens

    K4 = selector(positions == positions, n)

    scores4 = (Q @ K4.T) / (d_k ** 0.5)

     

    # Stack all head rating matrices: form (4, n, n)

    scores = torch.stack([scores1, scores2, scores3, scores4], dim=0)

     

    # Apply causal masks so place i can solely attend to positions <= i

    scores = scores.masked_fill(~causal_mask.unsqueeze(0), –1e9)

     

    # Convert logits to consideration weights

    weights = torch.softmax(scores, dim=–1)

     

    # Optionally available safeguard for absolutely masked rows

    all_masked = (scores <= –1e4).all(dim=–1, keepdim=True)

    weights = torch.the place(all_masked, torch.zeros_like(weights), weights)

     

    # Compute contexts: (heads, n, n) @ (n, 1) -> (heads, n, 1)

    contexts = (weights @ V).squeeze(–1)

     

    print(“Contexts by consideration head (rows) x token place (columns):n”, contexts)

     

    context4 = contexts[:, –1]

    print(“nContext for remaining immediate place:n”, context4)

    Output:

    Contexts by consideration heads (rows) x token place (columns):

    tensor([[10.0000, 20.0000, 20.0000, 12.5000],

            [10.0000, 15.0000, 10.3333,  5.0000],

            [10.0000, 10.0000, 10.0000, 10.0000],

            [10.0000, 15.0000, 10.3333,  9.0000]])

     

    Context for remaining immediate place:

    tensor([12.5000,  5.0000, 10.0000,  9.0000])

    The results of this step known as a context vector, which represents a weighted abstract of all earlier tokens.

    From contexts to logits

    Every consideration head has realized to select up on totally different patterns within the enter. Collectively, the 4 context values [12.5, 5.0, 10.0, 9.0] type a abstract of what “At this time’s climate is so…” represents. It can then mission to a matrix, which every column encodes how robust a given vocabulary is related to every consideration head’s sign, to provide logit rating per phrase.

    ...

    logits = context @ W_vocab

    For our instance, let’s say we now have “good”, “heat”, and “scrumptious” within the vocab:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    ...

    vocab = [“nice”, “warm”, “delicious”]

     

    # Every column corresponds to a vocab phrase

    # Every row corresponds to 1 consideration head function

    W_vocab = torch.tensor([

        [0.8, 0.6, 0.1], # head 1 weights → good, heat, scrumptious

        [0.5, 0.4, 0.2], # head 2 weights

        [0.1, 0.2, 0.5], # head 3 weights

        [0.2, 0.3, 0.1], # head 4 weights

    ]) # form: (4, 3)

     

    logits = context4 @ W_vocab # (4,) @ (4, 3) → (3,)

     

    for phrase, logit in zip(vocab, logits):

        print(f“{phrase:10s} {logit.merchandise():.3f}”)

    ```

    So the logits for “good” and “heat” are a lot increased than “scrumptious”.

    good       15.300

    heat       14.200

    scrumptious  8.150

    The Decode Part of LLM Inference

    Now suppose the mannequin generates the subsequent token: “good“. The duty is now to generate the subsequent token with the prolonged immediate:

    At this time’s climate is so good …

    The primary 4 phrases within the prolonged immediate are the identical as the unique immediate. And now we now have the fifth phrase within the immediate.

    Throughout decode, we don’t recompute consideration for all earlier tokens because the outcome can be the identical. As an alternative, we compute consideration just for the brand new token to save lots of time and compute sources. This produces a single new consideration row.

    new_token = “good”

    tokens = tokens + [new_token]

    new_value = torch.tensor([[7.0]]) # worth of “good” is 7

    V = torch.cat([V, new_value], dim=0)

    n = len(tokens)

    idx = torch.arange(n)

    pos = torch.arange(1, n + 1).float() # [1, 2, 3, 4, 5]

     

    print(“New tokens: “, tokens)

    print(“New Values: “, V)

    Output:

    New tokens:  [‘Today’, ‘weather’, ‘is’, ‘so’, ‘nice’]

    New Values:  tensor([[10.],

            [20.],

            [ 1.],

            [ 5.],

            [ 7.]])

    Now, we apply the 4 consideration heads and compute the brand new context vector:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    # Rebuild all Ok matrices for the subsequent token (n=5)

    # We are going to introduce KV-cache later

    K1_new = selector(pos % 2 == 0, n)   # even positions → +1

    K2_new = selector(pos == n,     n)   # final token     → +1

    K3_new = selector(pos == 1,     n)   # first token    → +1

    K4_new = selector(pos == pos,   n)   # all tokens     → +1

     

    # Throughout decode, solely compute Q for the NEW token (one row)

    Q_new = torch.ones(1, d_k)

     

    scores1_new = (Q_new @ K1_new.T) / (d_k ** 0.5)  # (1, 5)

    scores2_new = (Q_new @ K2_new.T) / (d_k ** 0.5)  # (1, 5)

    scores3_new = (Q_new @ K3_new.T) / (d_k ** 0.5)  # (1, 5)

    scores4_new = (Q_new @ K4_new.T) / (d_k ** 0.5)  # (1, 5)

     

    # Stack: form (4, 1, 5)

    new_scores = torch.stack(

        [scores1_new, scores2_new, scores3_new, scores4_new], dim=0)

     

    # No causal masks wanted — new token can see all earlier tokens by definition

    new_weights = torch.softmax(new_scores, dim=–1)  # (4, 1, 5)

     

    context5 = (new_weights @ V).squeeze()           # (4,)

     

    print(“Seen tokens:”, tokens)

    print(“Context for brand spanking new token place:n”, context5)

    Output:

    Seen tokens: [‘Today’, ‘weather’, ‘is’, ‘so’, ‘nice’]

    Context for brand spanking new token place:

    tensor([12.5000, 7.0000, 10.0000, 8.6000])

    Nonetheless, not like prefill the place the complete immediate is processed in parallel, decoding should generate tokens one after the other (autoregressively) as a result of the long run tokens haven’t but been generated. With out caching, each decode step would recompute keys and values for all earlier tokens from scratch, making the overall work throughout all decode steps $O(n^2)$ in sequence size. KV cache reduces this to $O(n)$ by computing every token’s $Ok$ and $V$ precisely as soon as.

    KV Cache: Learn how to Make Decode Extra Environment friendly

    To make the autoregressive docoding environment friendly, we are able to retailer the keys ($Ok$) and values ($V$) for each token individually for every consideration head. On this simplified instance we might use just one cache. Then, throughout decoding, when a brand new token is generated, the mannequin doesn’t recompute keys and values for all earlier tokens. It computes the question for the brand new token, and attends to the cached keys and values from earlier tokens.

    If we have a look at the earlier code once more, we are able to see that there isn’t any must recompute $Ok$ for the complete tensor:

    K1_new = selector(pos % 2 == 0, n) # even positions → +1

    As an alternative, we are able to merely compute Ok for the brand new place, and connect it to the Ok matrix we now have already computed and saved in cache:

    K1_new = selector(new_pos % 2 == 0, 1) # is pos 5 even? → -1

    K1_cache = torch.cat([K1, K1_new], dim=0) # (4→5, d_k)

    Right here’s the total code for decode part utilizing KV cache:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    # In decode we solely compute the question for the NEW token (place 5).

    new_pos = pos[–1:]  # tensor([5.])

     

    # Compute ONLY the brand new token’s key for every head

    K1_new = selector(new_pos % 2 == 0, 1)   # is pos 5 even?  → -1

    K2_new = selector(new_pos == n,      1)   # is pos 5 final?  → +1

    K3_new = selector(new_pos == 1,      1)   # is pos 5 first? → -1

    K4_new = selector(new_pos == new_pos, 1)   # at all times          → +1

     

    # Append new key to the cached prefill keys

    K1_cache = torch.cat([K1, K1_new], dim=0)  # (4→5, d_k)

    K2[–1] = –torch.ones(d_k)          # place 4 is now not final

    K2_cache = torch.cat([K2, K2_new], dim=0)

    K3_cache = torch.cat([K3, K3_new], dim=0)

    K4_cache = torch.cat([K4, K4_new], dim=0)

     

    # Q is just for the brand new token

    Q_dec = torch.ones(1, d_k)

     

    scores1_dec = (Q_dec @ K1_cache.T) / (d_k ** 0.5)

    scores2_dec = (Q_dec @ K2_cache.T) / (d_k ** 0.5)

    scores3_dec = (Q_dec @ K3_cache.T) / (d_k ** 0.5)

    scores4_dec = (Q_dec @ K4_cache.T) / (d_k ** 0.5)

     

    # Stack → (4 heads × 1 question × n keys)

    scores_dec = torch.stack([scores1_dec, scores2_dec, scores3_dec, scores4_dec], dim=0)

     

    # Softmax over key dimension

    weights_dec = torch.softmax(scores_dec, dim=–1)

     

    # Edge case: all-masked rows → zero context (similar guard as prefill)

    all_masked_dec = (scores_dec <= –1e4).all(dim=–1, keepdim=True)

    weights_dec = torch.the place(all_masked_dec, torch.zeros_like(weights_dec), weights_dec)

     

    # Context vectors: (4 × 1 × n) @ (n × 1) → (4 × 1 × 1) → squeeze → (4,)

    contexts_dec = (weights_dec @ V).squeeze(–1).squeeze(–1)

     

    print(“nDecode context for ‘good’ (one worth per head):n”, contexts_dec)

    Output:

    Decode context for ‘good’ (one worth per head):

    tensor([12.5000, 6.0000, 10.0000, 8.6000])

    Discover that is an identical to the outcome we computed with out the cache. KV cache doesn’t change what the mannequin computes, but it surely eliminates redundant computations.

    KV cache is totally different from the cache in different utility that the item saved is just not changed however up to date. Each new token added to the immediate appends a brand new row to the tensor saved. Implementing a KV cache that may effectively replace the tensor is the important thing to make LLM inference sooner.

    Additional Readings

    Under are some sources that you could be discover helpful:

    Abstract

    On this article, we walked via the 2 phases of LLM inference. Throughout Prefill, the total immediate is processed in a single parallel ahead move and the Keys and Values for each token are computed and saved. Throughout Decode, the mannequin generates one token at a time, utilizing solely the brand new token’s Question in opposition to the cached Keys and Values to keep away from redundant recomputation. Collectively, these two phases clarify why LLMs can course of lengthy prompts rapidly however generate output token by token, and why KV cache is important for making that technology sensible at scale.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Entropy-Preserving Reinforcement Studying – Apple Machine Studying Analysis

    April 2, 2026

    Automating aggressive worth intelligence with Amazon Nova Act

    April 2, 2026

    Construct Higher AI Brokers with Google Antigravity Expertise and Workflows

    April 1, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Did Google’s TurboQuant Really Remedy AI Reminiscence Crunch?

    By Hannah O’SullivanApril 2, 2026

    On March 25, 2026, Google Analysis printed a weblog submit…

    Cybersecurity within the age of immediate software program

    April 2, 2026

    3 Methods to Genuinely Acknowledge Your Staff

    April 2, 2026

    Entropy-Preserving Reinforcement Studying – Apple Machine Studying Analysis

    April 2, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.