[25.04.10] Retrieval‐Augmented Generation for Knowledge‐Intensive NLP Tasks - Paper-Reading-Study/2025 GitHub Wiki
Paper Reading Study Notes
General Information
Paper Title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
Published In: NeurIPS 2020
Year: 2021 (arXiv version discussed: April 12, 2021)
Research Problem: How to enhance large pre-trained language models (parametric memory) by allowing them to access and incorporate explicit, up-to-date external knowledge (non-parametric memory) for knowledge-intensive tasks where pure parametric models may hallucinate or lack specific facts.
Key Contributions: Introduced RAG (Retrieval-Augmented Generation) models combining a pre-trained retriever (DPR) with a pre-trained generator (BART). Proposed two variants: RAG-Sequence (same retrieved docs for the whole sequence) and RAG-Token (different docs potentially used per token). Showed state-of-the-art results on several open-domain QA tasks and improved factuality/specificity in generation tasks compared to BART alone.
Methodology/Approach: Uses DPR (Dense Passage Retriever) to find relevant documents (e.g., from Wikipedia) based on the input query. The generator (BART) then conditions on both the original input and the content of the retrieved document(s) to produce the output. The retrieved document is treated as a latent variable, marginalized out using a top-K approximation during generation. Different decoding strategies are needed for RAG-Sequence ("Thorough Decoding") vs. RAG-Token (standard beam search).
Results: RAG models outperformed strong parametric baselines (like T5-11B in some settings) and retrieve-and-extract architectures on open-domain QA (NQ, WebQ, TQA, CT). Generated text was found to be more factual and specific than BART in human evaluations for Jeopardy question generation. Showed effectiveness even when only fine-tuning the query encoder and generator, keeping the document index fixed.
Discussion Points
Strengths:
The framing of combining "parametric" and "non-parametric" memory was found novel and insightful (Attendee 2).
Seen as a practical implementation of hybrid AI, combining neural generation with symbolic-like retrieval (Attendee 2).
Ability to synthesize information from multiple documents for generation, not just extract (highlighted by results/heatmaps).
Addresses limitations of pure parametric models (factuality, knowledge updates).
Outperformed baselines significantly in the paper's experiments.
Weaknesses:
Strong skepticism about the fundamental accuracy of vector search (FAISS mentioned) used by the retriever; considered only marginally better than nothing by Attendee 1.
The decoding process for RAG-Sequence ("Thorough Decoding") is complex and computationally intensive, requiring extra forward passes for hypotheses not found in all beams.
Potential for "retrieval collapse" where the retriever learns to ignore the input, especially for less knowledge-grounded tasks like story generation (mentioned in transcript and paper appendix).
Key Questions:
How reliable is the performance of vector search in practice? (Attendee 1 doubts it).
How exactly is the probability calculated for a sequence y under a document z if y wasn't generated in the beam search for z during RAG-Sequence decoding? (Answered during discussion: requires a forced "forward pass").
Is RAG still relevant compared to potentially more advanced knowledge representation/retrieval methods like graph databases or knowledge ontologies (Palantir mentioned as an example)?
Relates to the debate between pure LLM approaches vs. hybrid systems incorporating symbolic components (LeCun/Chollet mentioned by Attendee 2).
Builds upon dense retrieval methods like DPR and vector search libraries like FAISS.
Contrasted with potentially more structured knowledge approaches like graph databases/ontologies (e.g., Palantir).
Notes and Reflections
Interesting Insights:
The parametric/non-parametric memory distinction is a useful conceptual framework.
RAG can be viewed as a step towards more robust, verifiable, and updatable AI by grounding generation in explicit knowledge.
The discussion highlighted the non-trivial complexity hidden behind the high-level concept, especially in the RAG-Sequence decoding.
Attendee 1's persistent skepticism about vector search accuracy provides a counterpoint to its widespread adoption.
Lessons Learned:
Understanding the difference between RAG-Token and RAG-Sequence (especially their decoding methods) is key.
Combining pre-trained components (retriever, generator) can be powerful but requires careful handling of their interaction (e.g., marginalization, decoding).
Evaluating the retrieval component's quality is as important as the generator's.
Future Directions:
Exploring alternative retrieval mechanisms beyond dense vectors (e.g., graph-based).
Investigating methods to prevent retrieval collapse.
Comparing RAG's performance and efficiency against more recent knowledge-grounding techniques.
Joint pre-training of retriever and generator components (mentioned in paper).