RAG (2020): The Paper That Named Retrieval-Augmented Generation

Quick answer

RAG is the 2020 Facebook AI (FAIR) paper that coined “retrieval-augmented generation” and gave it a clean recipe: a parametric seq2seq generator (BART-large) conditioned on passages pulled from a non-parametric memory — a dense vector index of 21M Wikipedia chunks, retrieved by DPR. Fine-tuned end-to-end, it set the state-of-the-art on three open-domain QA benchmarks at the time, and it could update what the model “knows” by swapping the index, with no gradient updates at all. Almost every production RAG stack today is a descendant of this exact design.

Parametric vs non-parametric memory

The paper’s framing is the part that stuck. A plain language model stores facts in its weights — parametric memory. That is fast but opaque: you cannot point at where an answer came from, and changing a fact means retraining. RAG adds a non-parametric memory: an external index of text you can read, cite, and edit. The generator no longer has to memorize the entire world; it learns how to use retrieved evidence. That split is why the abstract calls out two long-standing problems it addresses — providing provenance for a decision, and updating world knowledge — neither of which a parametric-only model handles cleanly.

How RAG retrieves and generates

The pipeline has two trained components. The retriever is DPR: a BERT-based question encoder embeds the query, a maximum-inner-product search over the Wikipedia index returns the top-K passages (the paper uses K around 5–10). The generator is BART-large, which reads the query plus a retrieved passage and produces the answer token by token. The retrieved passage is treated as a latent variable and marginalized out, so the whole thing trains by backprop through the generator and the question encoder while the document index stays fixed.

The paper compares two variants. RAG-Sequence picks passages once and conditions the entire output on the same set. RAG-Token can attend to a different passage for each generated token, which helps when an answer needs to stitch facts from several documents. RAG-Token tends to win on generation tasks; RAG-Sequence is often competitive and simpler on QA.

Key results

Open-domain QA, state-of-the-art at publication: RAG topped three open-domain QA datasets — Natural Questions, TriviaQA, and WebQuestions — beating both parametric-only seq2seq models and task-specific retrieve-and-extract pipelines. Notably it generates answers rather than extracting a span, yet still beats extractive systems.
More factual, specific, diverse generation: on open-ended generation (e.g. Jeopardy question generation), human evaluators judged RAG output more factual and more specific than a BART-only baseline, and automatic metrics showed higher diversity.
Hot-swappable knowledge: the authors replaced the Wikipedia index with one from a different date and the model answered time-sensitive questions correctly against the new index — without any retraining. This is the cleanest demonstration in the paper of why non-parametric memory matters.
Less hallucination on knowledge tasks: grounding generation in retrieved passages reduced the confident-but-wrong failures that a closed-book generator produces.

Why this is the paper to read

This is the original RAG paper, and reading it is still worth it even though the term now means “stick a vector DB in front of an LLM.” The 2020 version is more rigorous than most modern RAG stacks: retriever and generator are trained jointly, retrieval is a marginalized latent variable, and the knowledge-swap experiment is a real ablation, not a demo. Today’s production RAG — chunk documents, embed, top-K search, stuff into a prompt — is a simplified, training-free cousin of this. If you build RAG systems, this paper tells you which design choices were principled (separating memory, end-to-end training) and which were later dropped for engineering convenience (joint training, marginalization). My read: the industry kept RAG’s interface and threw away its training signal, which is exactly why retrieval quality is the bottleneck in most deployments.

Limits and open questions

The setup is dated in ways that matter. The generator is BART-large (~400M parameters), tiny next to today’s LLMs, so absolute scores are far below current systems — read it for the idea, not the leaderboard. Retrieval is capped at a fixed top-K with a frozen index during fine-tuning, so the model cannot recover from a bad retriever: if DPR misses the relevant passage, RAG has nothing to ground on. The Wikipedia-only corpus limits domains. And the elegant end-to-end joint training the paper relies on is expensive and rarely reproduced in practice — most “RAG” today never trains the retriever and generator together, which means the marginalization story does not hold for those systems. Whether jointly training retrieval back into modern LLMs pays off remains genuinely open.

FAQ

What does RAG stand for, and who invented it?

RAG stands for retrieval-augmented generation. The term and the method come from this 2020 paper by Patrick Lewis and colleagues at Facebook AI Research (now Meta AI). It combines a pre-trained seq2seq generator with a neural retriever over an external text index.

How is the original RAG paper different from modern RAG?

The 2020 RAG trains the retriever and generator together and treats the retrieved document as a marginalized latent variable. Modern production RAG usually skips training entirely: it embeds chunks, runs top-K vector search, and pastes results into an LLM prompt. The interface is the same; the training rigor is not.

What is the difference between RAG-Sequence and RAG-Token?

RAG-Sequence conditions the entire generated answer on one retrieved set of passages. RAG-Token can use a different passage for each output token, which helps when an answer combines facts from multiple documents. RAG-Token usually wins on generation; RAG-Sequence is simpler and strong on QA.

Can RAG update its knowledge without retraining?

Yes — that is a headline result. Because the knowledge lives in a non-parametric index, swapping the Wikipedia index for a newer one let the model answer time-sensitive questions correctly with no gradient updates. This is why RAG is the standard way to give LLMs fresh or private knowledge today.

One line: separate what the model knows from what it can look up, train it to use the evidence, and you get answers that are more factual and editable. Read the original paper on arXiv.