Skip to main content

Command Palette

Search for a command to run...

You Hit Enter. How Does AI Write Back?

Updated
5 min read
You Hit Enter. How Does AI Write Back?
U
A Software Engineer with 6+ years of experience in solving today's problems with technology

People often ask what happens after you hit Enter. Think of it as a high-speed relay race: your text is split into small blocks called tokens, those tokens are turned into numbers and passed through many layers of computation, and finally, the model picks the next tokens to send back as text. Below, I keep the relay-race metaphor while filling in the technical steps in plain language.

Text to tokens diagram

Diagram: text → tokens → embeddings → model layers → output tokens.


1. The Gateway: Tokenization

Computers don't read words the way humans do; they break text into smaller units called tokens. A token can be a whole word, a common subword (like "play"), or a fragment of a rarer word. For example, a tokenizer might split "unhappiness" into "un", "h", and "appiness". Different tokenizers (BPE, WordPiece, unigram) split text differently, so splits are not unique.

What happens next:

  • Each token is mapped to an index in the model’s vocabulary.

  • That index looks up a numeric vector called an embedding, the bridge from language to numbers.


Rules of Thumb for English Tokenization

Rule Why it matters
Common words often map to single tokens Efficient encoding and fewer tokens for frequent concepts
Rare words are split into subwords Let the model represent novel or misspelled words compositionally
Punctuation and whitespace are meaningful They influence token boundaries and the token count
Different tokenizers = different token counts The same sentence can produce different token sequences across models

2. Embeddings and Positional Encoding

Embeddings are vectors (arrays of numbers) that represent tokens in a high-dimensional space. They let the model perform math on meaning: similar tokens often have similar vectors. Because embeddings alone don’t encode order, models add positional encodings (either fixed patterns or learned vectors) so the model knows token position who runs first, second, etc., in our relay.

Short summary:

  • Token index → embedding vector

  • Add positional encoding → sequence-aware vectors fed into the model


3. The Relay: Transformer Layers and Attention

The transformer is the central mechanism doing the heavy lifting.

  • Self-attention: Each token looks at the other tokens and decides how much to “listen” to each one. Imagine each runner glancing at teammates to decide how to adjust their stride.

  • Attention heads: Attention is split into parallel “heads” that each learn different ways of relating tokens (one head might focus on grammar, another on topic words).

  • Feed-forward layers: After attention mixes information, feed-forward networks transform the result for each position.

  • Stacking: Many layers are stacked so the model can build complex, hierarchical patterns of meaning.

Why this matters:

  • Attention lets the model capture long-range relationships (e.g., a pronoun referring to a noun many tokens earlier).

  • Multiple layers let low-level patterns combine into higher-level ideas.


4. Decoding: Picking the Next Token

Once the model processes the input, it produces a score (logit) for every token in the vocabulary for the next position. These logits are turned into probabilities using softmax. Choosing the actual token is the decoding step.

Common decoding strategies:

  • Greedy decoding: pick the highest-probability token every time. Fast, but can be boring or get stuck.

  • Sampling: pick randomly according to the probability distribution. More varied, can be noisy.

  • Temperature: scales logits before softmax. Higher temperature → flatter distribution → more randomness. Lower temperature → sharper distribution → more deterministic.

  • Top-k: restrict sampling to the top k highest-probability tokens, then sample among them.

  • Top-p (nucleus) sampling: choose the smallest set of top tokens whose cumulative probability ≥ p, then sample from that set.

  • Beam search: keep multiple best partial sequences (beams) and expand them, often used in tasks like translation to optimize a sequence-level score.

Decoding is like choosing the next runner in the relay based on a weighted vote of who’s most likely to perform well next. Temperature and top-k/top-p change how confident or exploratory that vote is.


5. Detokenization and Output

After decoding picks tokens for every output position, those output tokens are mapped back to text and concatenated into the final string. This step must handle spacing, subword joiners, and special tokens correctly to produce readable text.

Example end-to-end (illustrative):

  • Model probabilities → sample token "ing"

  • Output tokens: ["I", " love", " writ", "ing", "."]

  • Detokenized: "I love writing."


6. Where This Matters (Practical Consequences)

  • Token limits: models have maximum context lengths measured in tokens, not characters. Long prompts or long outputs can hit that limit and get truncated.

  • Cost & latency: APIs often charge per token; smaller token counts are cheaper and faster.

  • Prompt engineering: small wording changes can change tokenization and model behavior.

  • Rare words: uncommon names and technical terms may split into many subwords, increasing token count and sometimes harming fluency.

  • Hallucination & coherence: decoding choices and model training influence when the model invents facts or loses track.

Further Reading

More from this blog

Codeplater's blog

11 posts