Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    FBI Accessed Home windows Laptops After Microsoft Shared BitLocker Restoration Keys – Hackread – Cybersecurity Information, Information Breaches, AI, and Extra

    January 25, 2026

    Pet Bowl 2026: Learn how to Watch and Stream the Furry Showdown

    January 25, 2026

    Why Each Chief Ought to Put on the Coach’s Hat ― and 4 Expertise Wanted To Coach Successfully

    January 25, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Thought Leadership in AI»The Journey of a Token: What Actually Occurs Inside a Transformer
    Thought Leadership in AI

    The Journey of a Token: What Actually Occurs Inside a Transformer

    Yasmin BhattiBy Yasmin BhattiDecember 1, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    The Journey of a Token: What Actually Occurs Inside a Transformer
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    On this article, you’ll find out how a transformer converts enter tokens into context-aware representations and, finally, next-token chances.

    Subjects we are going to cowl embody:

    • How tokenization, embeddings, and positional data put together inputs
    • What multi-headed consideration and feed-forward networks contribute inside every layer
    • How the ultimate projection and softmax produce next-token chances

    Let’s get our journey underway.

    The Journey of a Token: What Actually Occurs Inside a Transformer (click on to enlarge)
    Picture by Editor

    The Journey Begins

    Giant language fashions (LLMs) are primarily based on the transformer structure, a posh deep neural community whose enter is a sequence of token embeddings. After a deep course of — that appears like a parade of quite a few stacked consideration and feed-forward transformations — it outputs a likelihood distribution that signifies which token must be generated subsequent as a part of the mannequin’s response. However how can this journey from inputs to outputs be defined for a single token within the enter sequence?

    On this article, you’ll study what occurs inside a transformer mannequin — the structure behind LLMs — on the token degree. In different phrases, we are going to see how enter tokens or components of an enter textual content sequence flip into generated textual content outputs, and the rationale behind the adjustments and transformations that happen contained in the transformer.

    The outline of this journey by means of a transformer mannequin might be guided by the above diagram that exhibits a generic transformer structure and the way data flows and evolves by means of it.

    Getting into the Transformer: From Uncooked Enter Textual content to Enter Embedding

    Earlier than getting into the depths of the transformer mannequin, a couple of transformations already occur to the textual content enter, primarily so it’s represented in a type that’s totally comprehensible by the interior layers of the transformer.

    Tokenization

    The tokenizer is an algorithmic element sometimes working in symbiosis with the LLM’s transformer mannequin. It takes the uncooked textual content sequence, e.g. the consumer immediate, and splits it into discrete tokens (typically subword models or bytes, generally complete phrases), with every token within the supply language being mapped to an identifier i.

    Token Embeddings

    There’s a realized embedding desk E with form |V| × d (vocabulary dimension by embedding dimension). Trying up the identifiers for a sequence of size n yields an embedding matrix X with form n × d. That’s, every token identifier is mapped to a d-dimensional embedding vector that types one row of X. Two embedding vectors might be comparable to one another if they’re related to tokens which have comparable meanings, e.g. king and emperor, or vice versa. Importantly, at this stage, every token embedding carries semantic and lexical data for that single token, with out incorporating details about the remainder of the sequence (no less than not but).

    Positional Encoding

    Earlier than totally getting into the core components of the transformer, it’s essential to inject inside every token embedding vector — i.e. inside every row of the embedding matrix X — details about the place of that token within the sequence. That is additionally known as injecting positional data, and it’s sometimes achieved with trigonometric capabilities like sine and cosine, though there are methods primarily based on realized positional embeddings as properly. A virtually-residual element is summed to the earlier embedding vector e_t related to a token, as follows:

    [
    x_t^{(0)} = e_t + p_{text{pos}}(t)
    ]

    with p_pos(t) sometimes being a trigonometric-based operate of the token place t within the sequence. Because of this, an embedding vector that previously encoded “what a token is” solely now encodes “what the token is and the place within the sequence it sits”. That is equal to the “enter embedding” block within the above diagram.

    Now, time to enter the depths of the transformer and see what occurs inside!

    Deep Contained in the Transformer: From Enter Embedding to Output Possibilities

    Let’s clarify what occurs to every “enriched” single-token embedding vector because it goes by means of one transformer layer, after which zoom out to explain what occurs throughout the whole stack of layers.

    The method

    [
    h_t^{(0)} = x_t^{(0)}
    ]

    is used to indicate a token’s illustration at layer 0 (the primary layer), whereas extra generically we are going to use ht(l) to indicate the token’s embedding illustration at layer l.

    Multi-headed Consideration

    The primary main element inside every replicated layer of the transformer is the multi-headed consideration. That is arguably probably the most influential element in the whole structure in the case of figuring out and incorporating into every token’s illustration a number of significant details about its position in the whole sequence and its relationships with different tokens within the textual content, be it syntactic, semantic, or some other type of linguistic relationship. A number of heads on this so-called consideration mechanism are every specialised in capturing completely different linguistic facets and patterns within the token and the whole sequence it belongs to concurrently.

    The results of having a token illustration ht(l) (with positional data injected a priori, don’t neglect!) touring by means of this multi-headed consideration inside a layer is a context-enriched or context-aware token illustration. Through the use of residual connections and layer normalizations throughout the transformer layer, newly generated vectors turn out to be stabilized blends of their very own earlier representations and the multi-headed consideration output. This helps enhance coherence all through the whole course of, which is utilized repeatedly throughout layers.

    Feed-forward Neural Community

    Subsequent comes one thing comparatively much less advanced: a couple of feed-forward neural community (FFN) layers. For example, these could be per-token multilayer perceptrons (MLPs) whose aim is to additional rework and refine the token options which are step by step being realized.

    The principle distinction between the eye stage and this one is that focus mixes and incorporates, in every token illustration, contextual data from throughout all tokens, however the FFN step is utilized independently on every token, refining the contextual patterns already built-in to yield helpful “information” from them. These layers are additionally supplemented with residual connections and layer normalizations, and on account of this course of, we now have on the finish of a transformer layer an up to date illustration ht(l+1) that can turn out to be the enter to the subsequent transformer layer, thereby getting into one other multi-headed consideration block.

    The entire course of is repeated as many instances because the variety of stacked layers outlined in our structure, thus progressively enriching the token embedding with an increasing number of higher-level, summary, and long-range linguistic data behind these seemingly indecipherable numbers.

    Last Vacation spot

    So, what occurs on the very finish? On the high of the stack, after going by means of the final replicated transformer layer, we receive a closing token illustration ht*(L) (the place t* denotes the present prediction place) that’s projected by means of a linear output layer adopted by a softmax.

    The linear layer produces unnormalized scores known as logits, and the softmax converts these logits into next-token chances.

    Logits computation:

    [
    text{logits}_j = W_{text{vocab}, j} cdot h_{t^*}^{(L)} + b_j
    ]

    Making use of softmax to calculate normalized chances:

    [
    text{softmax}(text{logits})_j = frac{exp(text{logits}_j)}{sum_{k} exp(text{logits}_k)}
    ]

    Utilizing softmax outputs as next-token chances:

    [
    P(text{token} = j) = text{softmax}(text{logits})_j
    ]

    These chances are calculated for all potential tokens within the vocabulary. The subsequent token to be generated by the LLM is then chosen — typically the one with the best likelihood, although sampling-based decoding methods are additionally frequent.

    Journey’s Finish

    This text took a journey, with a mild degree of technical element, by means of the transformer structure to offer a normal understanding of what occurs to the textual content that’s offered to an LLM — probably the most outstanding mannequin primarily based on a transformer structure — and the way this textual content is processed and reworked contained in the mannequin on the token degree to lastly flip right into a mannequin’s output: the subsequent phrase to generate.

    We hope you might have loved our travels collectively, and we look ahead to the chance to embark upon one other within the close to future.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Yasmin Bhatti
    • Website

    Related Posts

    Why it’s crucial to maneuver past overly aggregated machine-learning metrics | MIT Information

    January 21, 2026

    Generative AI software helps 3D print private gadgets that maintain every day use | MIT Information

    January 15, 2026

    Methods to Learn a Machine Studying Analysis Paper in 2026

    January 15, 2026
    Top Posts

    FBI Accessed Home windows Laptops After Microsoft Shared BitLocker Restoration Keys – Hackread – Cybersecurity Information, Information Breaches, AI, and Extra

    January 25, 2026

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025
    Don't Miss

    FBI Accessed Home windows Laptops After Microsoft Shared BitLocker Restoration Keys – Hackread – Cybersecurity Information, Information Breaches, AI, and Extra

    By Declan MurphyJanuary 25, 2026

    Is your Home windows PC safe? A latest Guam court docket case reveals Microsoft can…

    Pet Bowl 2026: Learn how to Watch and Stream the Furry Showdown

    January 25, 2026

    Why Each Chief Ought to Put on the Coach’s Hat ― and 4 Expertise Wanted To Coach Successfully

    January 25, 2026

    How the Amazon.com Catalog Crew constructed self-learning generative AI at scale with Amazon Bedrock

    January 25, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.