RAG is a contract between memory and honesty
Retrieval-augmented generation works when search, chunking, and abstention are engineered—not when a vector database is sprinkled on top of a chatbot.
The pipeline is the product
Ingestion splits documents into chunks; embeddings place them in a vector space; queries retrieve neighbors; prompts assemble context; the model generates. Weakness anywhere propagates: a near-miss chunk yields fluent hallucination; stale indexes surface outdated policy; duplicated chunks amplify noise. Hybrid search (lexical + semantic) and metadata filters often beat pure embedding k-NN for enterprise content.
Windows and redundancy
Retrieved text must fit the context window alongside instructions and user messages. That forces prioritization, deduplication, and sometimes hierarchical summarization—ties to token budgets and latency.
When to refuse
Strong systems expose low-confidence retrieval and train policies to say “I cannot find that in your corpus.” This overlaps with alignment work ( RLHF) but is fundamentally an information-design problem—also relevant to automation bias when UIs hide uncertainty.
Agents complicate trust boundaries
Tool-using systems may fetch live data—see agents—blurring lines between static corpora and mutable APIs; injection risks rise.
Access control and row-level security
Enterprise corpora carry permission boundaries; retrieval must respect ACLs so users never see snippets from documents they should not access—embedding indexes must store security metadata alongside vectors. Failures here are compliance incidents, not mere quality bugs—see privacy.
Answer-time vs index-time enrichment
Some teams enrich chunks with summaries or entity tags at index time; others enrich at query time with lightweight classifiers. Trade-offs involve latency, staleness, and maintenance—coordinate with latency budgets.