Part of a 4-post series
In our previous articles, we explored how LLMs are trained on vast datasets and how transformer architecture helps them predict the next word (token). We then learned about prompt engineering and discovered how the input we provide significantly influences an LLM's output quality.
But here's a critical limitation: What happens when we need an LLM to respond about information it wasn't trained on?
Imagine asking ChatGPT about:
In such cases, the LLM might generate responses that are inaccurate, fabricated, or completely made up. This phenomenon is known as hallucination—when AI confidently provides incorrect information.
To address this limitation and enhance the accuracy of LLM responses, we can use a technique called Retrieval-Augmented Generation (RAG). RAG combines the strengths of information retrieval systems with generative models to provide more reliable and contextually relevant answers.
Think of RAG as giving an AI assistant access to a constantly updated library of information, rather than relying solely on its training memory.
The process begins by converting textual data into numerical representations called embeddings. These aren't just random numbers—they capture the semantic meaning of words, phrases, or entire documents in high-dimensional space.
Example: The words "car" and "automobile" would have very similar embeddings, even though they're spelled differently, because they have similar meanings.
These embeddings are stored in a vector database—a specialized database designed to handle high-dimensional data efficiently. Popular options include Pinecone, Weaviate, and Chroma.
Why vectors? They enable lightning-fast similarity searches. Instead of searching through text word-by-word, the system can find semantically similar content in milliseconds.
When you ask a question, RAG first converts your query into an embedding, then searches the vector database for the most similar embeddings. This retrieval step ensures the model has access to up-to-date and accurate information.
Example: Query "How do I reset my password?" → Retrieves company documentation about password reset procedures.
The retrieved information is then used to augment your original prompt, providing the LLM with additional context. This is like giving the AI relevant reference materials before asking it to answer.
Finally, the LLM generates a response based on the augmented input. By incorporating relevant retrieved data, the LLM produces outputs that are more reliable and less prone to hallucinations.
RAG architecture diagram
Let's see how RAG transforms AI assistance with a concrete example:
Lawyer's Question: "What's the current precedent for fair use in copyright cases involving AI-generated art?"
The system generates a response based only on its training data, which might:
Step 1: Retrieval
Step 2: Generation The system generates a response that:
Result: The lawyer receives accurate, current information based on the most recent legal authorities, not outdated training data.
RAG is particularly effective for:
At its core, RAG extends LLM capabilities by providing external, updatable memory that complements the model's inherent language understanding abilities. It's like giving an AI assistant access to a constantly updated, searchable library of information.
This is part of our series "Understanding What's Behind AI Chatbot." Check out our previous articles on Large Language Models and Prompt Engineering to build your AI knowledge foundation.