RAG & retrieval
What is RAG and what are its two phases?
Retrieval-augmented generation (RAG) lets a model answer from your own data by fetching relevant text and adding it to the prompt. It always runs in two phases: index your documents once offline, then retrieve the closest chunks at question time.
A model knows only what it absorbed during training, and that knowledge is frozen at a cutoff date. Retrieval-augmented generation (RAG) is the standard way around that limit: before the model answers, you fetch relevant text from your own data and add it to the prompt. The term and the method come from Lewis and colleagues in 2020, who paired a model’s built-in (parametric) memory with an external, searchable (non-parametric) memory.[1]
RAG always has two phases, and keeping them separate is the key to understanding it.
Index, once, offline. You take your documents, split them into chunks, turn each chunk into a vector with an embedding model, and store those vectors in an index. This is preparation; it happens before any question.
Retrieve, at question time. When a user asks something, you embed the question with the same model, search the index for the closest chunks, add them to the prompt, and let the model generate an answer grounded in that text.[2]
RAG building blocks
- FAISS ↗
Meta's open-source library for fast similarity search over vectors; a common local starting point.
- Pinecone ↗
Managed vector database with a widely-cited RAG explainer.
- LlamaIndex ↗
Framework for building the indexing and retrieval pipeline around your data.
- MTEB ↗
Benchmark for comparing embedding models before you pick one.