Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add qodex-ai/ai-agent-skills --skill "rag-agent-builder"
Install specific skill from multi-skill repository
# Description
Build Retrieval-Augmented Generation (RAG) applications that combine LLM capabilities with external knowledge sources. Covers vector databases, embeddings, retrieval strategies, and response generation. Use when building document Q&A systems, knowledge base applications, enterprise search, or combining LLMs with custom data.
# SKILL.md
name: rag-agent-builder
description: Build Retrieval-Augmented Generation (RAG) applications that combine LLM capabilities with external knowledge sources. Covers vector databases, embeddings, retrieval strategies, and response generation. Use when building document Q&A systems, knowledge base applications, enterprise search, or combining LLMs with custom data.
RAG Agent Builder
Build powerful Retrieval-Augmented Generation (RAG) applications that enhance LLM capabilities with external knowledge sources, enabling accurate, contextualized AI responses.
Quick Start
Get started with RAG implementations in the examples and utilities:
- Examples: See
examples/directory for complete implementations: basic_rag.py- Simple chunk-embed-retrieve-generate pipelineretrieval_strategies.py- Hybrid search, reranking, and filtering-
agentic_rag.py- Agent-controlled retrieval with iterative refinement -
Utilities: See
scripts/directory for helper modules: embedding_management.py- Embedding generation, normalization, and cachingvector_db_manager.py- Vector database abstraction and factoryrag_evaluation.py- Retrieval and answer quality metrics
Overview
RAG systems combine three key components:
1. Document Retrieval - Find relevant information from knowledge bases
2. Context Integration - Pass retrieved context to the LLM
3. Response Generation - Generate answers grounded in the retrieved information
This skill covers building production-ready RAG applications with various frameworks and approaches.
Core Concepts
What is RAG?
RAG augments LLM knowledge with external data:
- Without RAG: LLM relies on training data (may be outdated or limited)
- With RAG: LLM uses real-time, custom knowledge + training knowledge
When to Use RAG
- Document Q&A: Answer questions about PDFs, books, reports
- Knowledge Base Search: Query internal documentation, wikis
- Enterprise Search: Search proprietary company data
- Context-Specific Assistants: Customer support, HR assistants
- Fact-Heavy Applications: Legal docs, medical records, financial data
When RAG Might Not Be Needed
- General knowledge questions (ChatGPT-like)
- Real-time data that changes constantly (use tools instead)
- Very simple lookup tasks (use database queries)
Architecture Patterns
Basic RAG Pipeline
Documents β Chunks β Embeddings β Vector DB
β
User Question β Embedding β Retrieval β LLM β Answer
β β
Vector DB Context
Advanced RAG Patterns
1. Agentic RAG
- Agent decides what to retrieve and when
- Can refine queries iteratively
- Better for complex reasoning
2. Hierarchical RAG
- Multi-level document structure
- Search at different levels of detail
- More flexible organization
3. Hybrid Search RAG
- Combines keyword search (BM25) + semantic search (embeddings)
- Captures both exact matches and meaning
- Better for mixed query types
4. Corrective RAG (CRAG)
- Evaluates retrieved documents for relevance
- Retrieves additional sources if needed
- Ensures high-quality context
Implementation Components
1. Document Processing
Chunking Strategies:
# Simple fixed-size chunks
chunks = split_text(doc, chunk_size=1000, overlap=100)
# Semantic chunks (group by meaning)
chunks = semantic_chunking(doc, max_tokens=512)
# Hierarchical chunks (different levels)
chapters = split_by_heading(doc)
chunks = split_each_chapter(chapters, size=1000)
Key Considerations:
- Chunk size affects retrieval quality and cost
- Overlap helps maintain context between chunks
- Semantic chunking preserves meaning better
2. Embedding Generation
Popular Embedding Models:
- OpenAI: text-embedding-3-small, text-embedding-3-large
- Open Source: all-MiniLM-L6-v2, all-mpnet-base-v2
- Domain-Specific: Domain-trained embeddings for specialized knowledge
Best Practices:
- Use consistent embedding model for retrieval and queries
- Store embeddings with normalized vectors
- Update embeddings when documents change
3. Vector Databases
Popular Options:
- Pinecone: Managed, serverless, easy to scale
- Weaviate: Open-source, self-hosted, flexible
- Milvus: Open-source, high performance
- Chroma: Lightweight, good for prototypes
- Qdrant: Production-grade, high-performance
Selection Criteria:
- Scale requirements (data volume, queries per second)
- Latency needs (real-time vs batch)
- Cost considerations
- Deployment preferences (managed vs self-hosted)
4. Retrieval Strategies
Retrieval Methods:
# Similarity search (most common)
results = vector_db.query(question_embedding, k=5)
# Hybrid search (keyword + semantic)
keyword_results = bm25.search(question, k=3)
semantic_results = vector_db.query(embedding, k=3)
results = combine_and_rank(keyword_results, semantic_results)
# Reranking (improve relevance)
retrieved = initial_retrieval(query)
reranked = rerank_by_relevance(retrieved, query)
Retrieval Parameters:
- k (number of results): Balance between context and relevance
- Similarity threshold: Filter out low-relevance results
- Diversity: Return varied results vs best matches
5. Context Integration
Context Window Management:
# Fit retrieved documents into context window
def prepare_context(retrieved_docs, max_tokens=3000):
context = ""
for doc in retrieved_docs:
if len(tokenize(context + doc)) <= max_tokens:
context += doc
else:
break
return context
Prompt Design:
You are a helpful assistant. Answer the question based on the provided context.
Context:
{retrieved_documents}
Question: {user_question}
Answer:
6. Response Generation
Generation Strategies:
- Direct Generation: LLM answers from context
- Summarization: Summarize multiple retrieved docs first
- Fact-Grounding: Ensure answer cites sources
- Iterative Refinement: Refine based on user feedback
Implementation Patterns
Pattern 1: Basic RAG
Simplest RAG implementation:
1. Split documents into chunks
2. Generate embeddings for each chunk
3. Store in vector database
4. Retrieve top-k similar chunks for query
5. Pass to LLM with context
Pros: Simple, fast, works well for straightforward QA
Cons: May miss relevant context, no refinement
Pattern 2: Agentic RAG
Agent controls retrieval:
1. Agent receives user question
2. Decides whether to retrieve documents
3. Formulates retrieval query (may differ from original)
4. Retrieves relevant documents
5. Can iterate or use tools
6. Generates final answer
Pros: Better for complex questions, iterative improvement
Cons: More complex, higher costs
Pattern 3: Corrective RAG (CRAG)
Validates retrieved documents:
1. Retrieve documents for question
2. Grade each document for relevance
3. If poor relevance:
- Try different retrieval strategy
- Expand search scope
- Retrieve from different sources
4. Generate answer from validated context
Pros: Higher quality answers, adapts to failures
Cons: More API calls, slower
Popular Frameworks
LangChain
from langchain.document_loaders import PDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
# Load documents
loader = PDFLoader("document.pdf")
docs = loader.load()
# Create RAG chain
embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(docs, embeddings)
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
answer = qa.run("What is the document about?")
LlamaIndex
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
# Load documents
documents = SimpleDirectoryReader("./data").load_data()
# Create index
index = GPTVectorStoreIndex.from_documents(documents)
# Query
response = index.as_query_engine().query("What is the main topic?")
CrewAI with RAG
from crewai import Agent, Task, Crew
from tools import retrieval_tool
researcher = Agent(
role="Research Assistant",
goal="Research topics using knowledge base",
tools=[retrieval_tool]
)
research_task = Task(
description="Research the topic: {topic}",
agent=researcher
)
Best Practices
Document Preparation
- β Clean and normalize text (remove headers, footers)
- β Preserve document structure when possible
- β Add metadata (source, date, category)
- β Handle PDFs with OCR if scanned
- β Test chunk sizes for your domain
Embedding Strategy
- β Use same embedding model for indexing and queries
- β Fine-tune embeddings for domain-specific needs
- β Normalize embeddings for consistency
- β Monitor embedding quality metrics
Retrieval Optimization
- β Tune k (number of results) for your use case
- β Use reranking for quality improvement
- β Implement relevance filtering
- β Monitor retrieval precision and recall
- β Cache frequently retrieved documents
Generation Quality
- β Include source citations in answers
- β Prompt LLM to indicate confidence
- β Ask to cite specific documents
- β Generate summaries for long contexts
- β Validate answers against context
Monitoring & Evaluation
- β Track retrieval metrics (precision, recall, MRR)
- β Monitor answer quality and relevance
- β Log failed retrievals for improvement
- β Collect user feedback
- β Iterate based on failures
Common Challenges & Solutions
Challenge: Irrelevant Retrieval
Solutions:
- Improve chunking strategy
- Better embedding model
- Add document metadata to queries
- Implement reranking
- Use hybrid search
Challenge: Context Too Large
Solutions:
- Reduce chunk size
- Retrieve fewer results (smaller k)
- Summarize retrieved context
- Use hierarchical retrieval
- Filter by relevance score
Challenge: Missing Information
Solutions:
- Increase k (retrieve more)
- Improve embedding model
- Better preprocessing
- Use multiple search strategies
- Add document hierarchy
Challenge: Slow Performance
Solutions:
- Use managed vector database
- Cache embeddings
- Batch process documents
- Optimize chunk size
- Use smaller embedding model for speed
Evaluation Metrics
Retrieval Metrics:
- Precision: % of retrieved docs that are relevant
- Recall: % of relevant docs that are retrieved
- MRR (Mean Reciprocal Rank): Rank of first relevant result
- NDCG (Normalized DCG): Quality of ranking
Answer Quality Metrics:
- Relevance: Does answer address the question?
- Correctness: Is the answer factually accurate?
- Grounding: Is answer supported by context?
- User Satisfaction: Would user find answer helpful?
Advanced Techniques
1. Query Expansion
# Expand query with related terms
expanded_query = query + " " + synonym_expansion(query)
results = retrieve(expanded_query)
2. Document Compression
# Compress retrieved docs before passing to LLM
compressed = compress_documents(retrieved_docs, query)
context = format_context(compressed)
3. Active Retrieval
# Iteratively refine retrieval based on LLM output
query = user_question
while iterations < max:
results = retrieve(query)
answer = generate_with_context(results)
if answer_complete(answer):
break
query = refine_query(answer)
4. Multi-Modal RAG
# Retrieve both text and images
text_results = text_retriever.query(question)
image_results = image_retriever.query(question)
context = combine_multimodal(text_results, image_results)
Resources & References
Key Papers
- "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.)
- "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.)
Frameworks
- LangChain: https://python.langchain.com/
- LlamaIndex: https://www.llamaindex.ai/
- HayStack: https://haystack.deepset.ai/
Vector Databases
- Pinecone: https://www.pinecone.io/
- Weaviate: https://weaviate.io/
- Qdrant: https://qdrant.tech/
Embedding Models
- OpenAI: https://platform.openai.com/docs/guides/embeddings
- Hugging Face: https://huggingface.co/models?pipeline_tag=sentence-similarity
Next Steps
- Choose your stack: Decide on framework (LangChain, LlamaIndex, etc.)
- Prepare documents: Process and chunk your knowledge base
- Select embeddings: Choose embedding model for your domain
- Pick vector DB: Select storage solution for scale
- Build pipeline: Implement retrieval and generation
- Evaluate: Test on sample questions and iterate
- Monitor: Track quality metrics in production
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.