PhyseaWiki How AI actually works Papers physea.ai →

Subject 02 · Builds on Foundations

Language Model Architecture

How a language model is built: the transformer, attention, embeddings, context windows, and how text is generated one token at a time.

26 pages across 6 topics

The transformer

The architecture behind modern LLMs.

Before transformers The transformer is the neural network architecture behind essentially every modern large language model. Before it, recurrent neural networks read text one word at a time, which made training slow and caused early words to fade; the transformer drops recurrence so every word can look at every other word at once.
Self-attention Self-attention is the transformer's key move: for each word, the model weighs every other word by how relevant it is and blends them in. Because there is no left-to-right dependency, the whole sequence is processed at once, and that parallelism is what let models train fast enough on GPUs to scale.
The block, repeated A transformer is not one big thing; it is one block stacked many times. Each block has a multi-head self-attention layer and a small feed-forward network, each wrapped in a residual connection and layer normalization, plus positional encoding so the model knows word order.
Three families The original transformer was an encoder-decoder built for translation. Later models specialized into three shapes: encoder-only for understanding text, decoder-only for generating it, and the full encoder-decoder for translation-style tasks. Today's chatbots are overwhelmingly decoder-only.

Attention

How tokens look at each other.

Embeddings & vectors

Meaning as geometry.

Context windows

How much a model can hold at once.

Parameters & layers

What a model is made of, and how big.

How text is generated

Sampling the next token.