Subject 02 · Builds on Foundations
Language Model Architecture
How a language model is built: the transformer, attention, embeddings, context windows, and how text is generated one token at a time.
26 pages across 6 topics
The transformer
The architecture behind modern LLMs.
- Before transformers The transformer is the neural network architecture behind essentially every modern large language model. Before it, recurrent neural networks read text one word at a time, which made training slow and caused early words to fade; the transformer drops recurrence so every word can look at every other word at once.
- Self-attention Self-attention is the transformer's key move: for each word, the model weighs every other word by how relevant it is and blends them in. Because there is no left-to-right dependency, the whole sequence is processed at once, and that parallelism is what let models train fast enough on GPUs to scale.
- The block, repeated A transformer is not one big thing; it is one block stacked many times. Each block has a multi-head self-attention layer and a small feed-forward network, each wrapped in a residual connection and layer normalization, plus positional encoding so the model knows word order.
- Three families The original transformer was an encoder-decoder built for translation. Later models specialized into three shapes: encoder-only for understanding text, decoder-only for generating it, and the full encoder-decoder for translation-style tasks. Today's chatbots are overwhelmingly decoder-only.
Attention
How tokens look at each other.
- Query, Key, Value Each token is projected into three vectors: a query (what it is looking for), a key (what it offers as a match), and a value (the content passed along). A token's output is a weighted sum of values, weighted by how well its query matches each key.
- Scaling and Softmax To score a query against a key, the transformer takes their dot product, divides by the square root of the key dimension, and applies softmax. The division keeps the softmax out of low-gradient regions, so learning stays stable.
- Why Tokens Attend Meaning is contextual: resolving a pronoun or tracking agreement needs information from elsewhere in the sentence. Self-attention relates every position to every other directly, so distant words influence each other in one step.
- Multi-Head Attention Rather than attend once, the model runs several attention computations in parallel. Each head has its own learned projections, so heads can specialize (grammar in one, pronoun reference in another) and combine into a richer representation.
- Where Attention Came From Attention began in 2014 as a fix for translation: instead of compressing a source sentence into one fixed-length vector, the model could soft-search for relevant parts. That soft-search is the direct ancestor of modern attention.
Embeddings & vectors
Meaning as geometry.
- Meaning as geometry An embedding is a list of numbers, a vector, that represents text as a point in a high-dimensional space where the geometry encodes meaning, so similar items land close together. The numbers are learned, not assigned, so each dimension is a latent feature rather than a human-named label.
- Measuring relatedness To compare two embeddings you measure the angle or distance between their vectors. Cosine similarity, the cosine of the angle between them, is the standard tool: higher means more related, and for length-one vectors it reduces to a dot product.
- Words to sentences Early embeddings were per-word, but real applications need one vector for a whole sentence or passage. Sentence-BERT adapted BERT with siamese and triplet networks to derive comparable sentence embeddings, cutting similarity search from hours to seconds.
- Semantic search & RAG Semantic search embeds both the query and the documents and ranks by vector similarity, so it matches meaning rather than keywords. The same machinery underpins retrieval-augmented generation, which fetches relevant passages from a vector index and conditions a model's answer on them.
- Picking a model Embedding quality varies by task and language, so models are benchmarked. The Massive Text Embedding Benchmark (MTEB) finds that no single method dominates across all tasks, so the right embedding model depends on what you are doing with it.
Context windows
How much a model can hold at once.
- Context Window Basics A context window is all the text a model can reference while producing a response, including the response itself. It acts as working memory and is measured in tokens, not words.
- What Fills the Window The window holds the system prompt, the conversation so far, any documents you paste in, and the response the model writes. Every turn adds to it, so usage grows as a chat continues.
- Why Limits Exist Self-attention has every token look at every other token, so its cost rises with the square of the length. Doubling the context roughly quadruples the work, which is why windows have a ceiling.
- Long Context, Lost Middle Long-context models now hold a million tokens or more, but a bigger window is not used evenly. Models tend to find information at the start and end of the context better than information buried in the middle.
Parameters & layers
What a model is made of, and how big.
- What a parameter is A parameter is one adjustable number inside the model. Training nudges all of these numbers until the model predicts text well, so the full set of parameters is where what the model learned actually lives.
- Layers and depth A language model is a stack of repeated layers. Each layer takes the running representation of the text and reworks it a little, and the count of layers is what people mean by depth.
- Reading model sizes The B in a model name is the parameter count in billions, so 7B is seven billion and 70B is seventy billion. That number drives memory: at full precision each billion parameters needs about 2 GB, and quantization trims that down.
- Weights vs activations Weights are the model's permanent, learned numbers and they stay the same on every run. Activations are the throwaway numbers computed for the specific text you give it, and that working memory grows with how much text the model is handling.
How text is generated
Sampling the next token.
- One token at a time A language model writes by predicting one token, appending it, then predicting again from the longer text. This step-by-step loop, where each new token depends on everything before it, is called autoregressive generation.
- Logits to probabilities At each step the model produces one raw score, called a logit, for every token in its vocabulary. A function called softmax converts those scores into probabilities that sum to 1, which is the distribution the model then picks from.
- Sampling controls Once the model has a probability distribution, it can take the single most likely token (greedy) or sample from the distribution. Temperature, top-k, and top-p are the knobs that control how adventurous that sampling is.
- When it stops The generation loop has to end somewhere. A model usually stops when it finishes its turn naturally, but you can also force a stop with custom stop sequences or by capping the number of tokens.