Question 1

What's the difference between RAG and fine-tuning?

Accepted Answer

RAG retrieves external documents at query time to augment the LLM's context; fine-tuning updates the model's weights with domain data. RAG is faster to implement, cheaper, and better for frequently-changing data (news, product catalogs, customer data). Fine-tuning is better when you need consistent style/behavior changes or have 100+ domain-specific examples. In 2026: 80% of production AI apps use RAG, 15% use fine-tuning, 5% use both. Rule of thumb: start with RAG because it's 10x faster to ship, switch to fine-tuning if RAG fails due to complexity.

Question 2

Why do my RAG results have terrible quality?

Accepted Answer

Quality depends on: (1) chunking strategy — too large chunks = noise, too small = fragmentation. Test chunk size 256-512 tokens with 10-20% overlap. (2) Embedding model choice — OpenAI text-embedding-3-large beats older models. (3) Hybrid search — vector-only misses exact matches; add BM25 keyword search. (4) Metadata filtering — don't vector-search everything, filter by date/category first. (5) Reranking — use a lightweight reranker (CrossEncoder from sentence-transformers) to re-score retrieved docs. Most failures = chunking issue. Fix that first before blaming the embedding model.

Question 3

How do I evaluate whether my RAG system works?

Accepted Answer

Separate evaluation into three parts: (1) Retrieval quality — are the right documents returned? Use RAGAS metrics (Hit Rate, MRR). (2) Generation quality — given those documents, is the answer correct? Use LLM-as-judge (GPT scores your answer). (3) End-to-end — does the user get a useful answer? Use human evaluation + production metrics (thumbs up/down). Don't optimize for one metric in isolation. A 100% retrieval score means nothing if generation is hallucinating. Use a framework like RAGAS to track all three.

Question 4

How should I chunk documents for RAG?

Accepted Answer

Chunking strategy has the biggest impact on RAG quality. Start with semantic chunking (split on paragraph/section boundaries, not arbitrary token counts). For structured data: chunk by table/entity. For long documents: use recursive chunking (split at paragraph → sentence → token level). For code: chunk by function/class. Overlap 10-20% tokens between chunks so context isn't lost at boundaries. Test on your domain: chunks too large miss specificity, too small fragment meaning. Typical winning range: 256-512 tokens per chunk. Use a tool like Langchain's `RecursiveCharacterTextSplitter` or build a custom chunker for your domain.

Question 5

Should I use vector-only search or hybrid search?

Accepted Answer

Always hybrid. Vector-only (semantic search) is great for finding similar documents but misses exact matches. Example: if user asks 'what is the price of plan A?', vector search might return 'plan B pricing' as similar, but it's wrong. Hybrid = vector search (semantic) + BM25 (keyword), then fuse the results. Fusion strategies: (1) simple averaging of scores, (2) RRF (reciprocal rank fusion), (3) vector weighted (60% vector, 40% keyword). Hybrid typically improves quality 20-30%. The cost is negligible — query a vector index and a keyword index in parallel, merge results in < 10ms.

Question 6

How do I deploy RAG to production?

Accepted Answer

Three layers: (1) Vector DB — use managed services (Pinecone, Weaviate Cloud) not self-hosted for uptime. (2) LLM — OpenAI API (easiest), Cohere (cheaper), or self-hosted Llama via services. (3) Orchestration — LangChain/LlamaIndex handle the prompt + retrieval + generation flow. Observability is critical: log every query, retrieval score, and generated answer. Use Langfuse or Arize to track quality. Add rate limiting (10-100 req/s per user). For latency: chunk retrieval + generation can run in parallel (~800ms total). Cache popular queries — 20% of queries are duplicates. Monitor for drift: if retrieval quality drops > 10%, alert and investigate.

Question 7

What's agentic RAG and when should I use it?

Accepted Answer

Agentic RAG = the LLM decides what to retrieve and how many times. Instead of one retrieval step, the agent can: (1) break the question into sub-questions, (2) retrieve for each sub-query, (3) decide if more retrieval is needed, (4) reason over all results. Example: user asks 'What's our strategy for enterprise vs SMB pricing?' Agent: query 'enterprise pricing' → read answer → query 'SMB pricing' → compare → synthesize. It's more powerful but slower (multiple round-trips) and more expensive (more LLM calls). Use agentic RAG for complex, multi-step questions. Use simple RAG for straightforward lookups. In 2026: agentic RAG is becoming standard but still 2-3x slower.

Region	Junior	Mid	Senior
USA	$130k	$180k	$260k
UK	£80k	£110k	£150k
EU	€85k	€125k	€170k
CANADA	C$140k	C$195k	C$280k

RAG Architecture

What is RAG Architecture

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using RAG Architecture

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path