βΆWhat's the difference between RAG and fine-tuning?
RAG retrieves external documents at query time to augment the LLM's context; fine-tuning updates the model's weights with domain data. RAG is faster to implement, cheaper, and better for frequently-changing data (news, product catalogs, customer data). Fine-tuning is better when you need consistent style/behavior changes or have 100+ domain-specific examples. In 2026: 80% of production AI apps use RAG, 15% use fine-tuning, 5% use both. Rule of thumb: start with RAG because it's 10x faster to ship, switch to fine-tuning if RAG fails due to complexity.
βΆWhy do my RAG results have terrible quality?
Quality depends on: (1) chunking strategy β too large chunks = noise, too small = fragmentation. Test chunk size 256-512 tokens with 10-20% overlap. (2) Embedding model choice β OpenAI text-embedding-3-large beats older models. (3) Hybrid search β vector-only misses exact matches; add BM25 keyword search. (4) Metadata filtering β don't vector-search everything, filter by date/category first. (5) Reranking β use a lightweight reranker (CrossEncoder from sentence-transformers) to re-score retrieved docs. Most failures = chunking issue. Fix that first before blaming the embedding model.
βΆHow do I evaluate whether my RAG system works?
Separate evaluation into three parts: (1) Retrieval quality β are the right documents returned? Use RAGAS metrics (Hit Rate, MRR). (2) Generation quality β given those documents, is the answer correct? Use LLM-as-judge (GPT scores your answer). (3) End-to-end β does the user get a useful answer? Use human evaluation + production metrics (thumbs up/down). Don't optimize for one metric in isolation. A 100% retrieval score means nothing if generation is hallucinating. Use a framework like RAGAS to track all three.
βΆHow should I chunk documents for RAG?
Chunking strategy has the biggest impact on RAG quality. Start with semantic chunking (split on paragraph/section boundaries, not arbitrary token counts). For structured data: chunk by table/entity. For long documents: use recursive chunking (split at paragraph β sentence β token level). For code: chunk by function/class. Overlap 10-20% tokens between chunks so context isn't lost at boundaries. Test on your domain: chunks too large miss specificity, too small fragment meaning. Typical winning range: 256-512 tokens per chunk. Use a tool like Langchain's `RecursiveCharacterTextSplitter` or build a custom chunker for your domain.
βΆShould I use vector-only search or hybrid search?
Always hybrid. Vector-only (semantic search) is great for finding similar documents but misses exact matches. Example: if user asks 'what is the price of plan A?', vector search might return 'plan B pricing' as similar, but it's wrong. Hybrid = vector search (semantic) + BM25 (keyword), then fuse the results. Fusion strategies: (1) simple averaging of scores, (2) RRF (reciprocal rank fusion), (3) vector weighted (60% vector, 40% keyword). Hybrid typically improves quality 20-30%. The cost is negligible β query a vector index and a keyword index in parallel, merge results in < 10ms.
βΆHow do I deploy RAG to production?
Three layers: (1) Vector DB β use managed services (Pinecone, Weaviate Cloud) not self-hosted for uptime. (2) LLM β OpenAI API (easiest), Cohere (cheaper), or self-hosted Llama via services. (3) Orchestration β LangChain/LlamaIndex handle the prompt + retrieval + generation flow. Observability is critical: log every query, retrieval score, and generated answer. Use Langfuse or Arize to track quality. Add rate limiting (10-100 req/s per user). For latency: chunk retrieval + generation can run in parallel (~800ms total). Cache popular queries β 20% of queries are duplicates. Monitor for drift: if retrieval quality drops > 10%, alert and investigate.
βΆWhat's agentic RAG and when should I use it?
Agentic RAG = the LLM decides what to retrieve and how many times. Instead of one retrieval step, the agent can: (1) break the question into sub-questions, (2) retrieve for each sub-query, (3) decide if more retrieval is needed, (4) reason over all results. Example: user asks 'What's our strategy for enterprise vs SMB pricing?' Agent: query 'enterprise pricing' β read answer β query 'SMB pricing' β compare β synthesize. It's more powerful but slower (multiple round-trips) and more expensive (more LLM calls). Use agentic RAG for complex, multi-step questions. Use simple RAG for straightforward lookups. In 2026: agentic RAG is becoming standard but still 2-3x slower.