βΆSystem prompt vs user prompt β when do I use each?
System prompt (defined once per conversation): sets role, tone, constraints, output format. User prompt (per query): the actual task/question. Example: system='You are a Python expert who writes secure, efficient code'; user='Write a function to parse CSV files'. System prompt persists across turns, user prompts change. Pro tip: put guardrails in system prompt, task details in user prompt. System is cheap to reuse; user prompt should vary per query.
βΆChain-of-thought vs ReAct vs multi-turn β what's the difference?
Chain-of-thought (CoT): prompt model to show reasoning steps before answering. 'Let's think step by step.' Improves accuracy on math/logic 30-40%. ReAct (Reasoning + Action): interleave reasoning with tool calls β model decides what tools to use. Multi-turn: conversation history preserved. Use CoT for reasoning tasks, ReAct for tool use, multi-turn for dialogue. CoT costs 2x tokens but 40% accuracy gain. ReAct adds latency (multiple API calls) but solves novel tasks.
βΆRAG vs fine-tuning β when do I retrieve and when do I retrain?
RAG (Retrieval-Augmented Generation): fetch relevant docs, insert into prompt at query time. Fast to update (just add docs), works with any model, costs ~10% more tokens. Fine-tuning: retrain model on your data, slower/expensive ($100-1000s), locked to one model, updates take hours. Use RAG for: document QA, real-time data, frequently changing facts. Use fine-tuning for: style mimicry, domain-specific reasoning, latency-critical. RAG+fine-tune hybrid: retrieve context, fine-tuned model reads it.
βΆHow do I evaluate prompt quality objectively?
Three metrics: (1) accuracy β does it solve the task? Use evals framework (Promptfoo, Braintrust, Evals CLI). (2) Latency β tokens/sec, cost per request. (3) Consistency β same input, same output? Run 10x with temperature=0. Common eval patterns: classification (exact match), generation (BLEU/ROUGE), reasoning (trace-based assertions). Never trust 'feels better'. Build a test harness with 20-50 examples, measure before/after prompt change. Tool: Promptfoo makes this 10 lines of YAML.
βΆWhat agentic patterns exist and when do I use each?
ReAct (reasoning + action loops): agent reasons, decides action, observes result, repeats. Use for: multi-step tasks, tool use. Plan-execute: agent plans steps, then executes. Use for: complex workflows, need visibility into plan. Tree-of-thought: explore multiple reasoning paths, prune low-value branches. Use for: hard reasoning, when wrong answer is costly. Hierarchical agents: manager agent delegates to specialist sub-agents. Use for: modular systems. Most robust: ReAct + tool use + human-in-the-loop for review steps.
βΆJailbreaks, prompt injection, adversarial prompts β how do I defend against them?
Jailbreak: bypass safety guardrails via clever phrasing ('assume you're a character whoβ¦'). Defense: use system prompt with firm boundaries ('You will notβ¦' is weaker than 'Your role preventsβ¦'). Prompt injection: user input pollutes instructions. Defense: (1) separate user input from system instructions (use API parameters, not concatenation), (2) XML tags to mark input boundaries, (3) validation on output. Adversarial: user tries to trigger wrong behavior. Defense: test with adversarial examples in evals, log failures, add guardrails. Rule: never trust user input in the prompt directly β always escape or parameterize.
βΆHow do I choose between GPT-4, Claude, Llama, and specialized models?
GPT-4: strongest reasoning, best for code/math, most expensive ($0.03-$0.06/1K tokens). Claude 3.5 Sonnet: excellent context window (200k), strong summarization, middle cost (~$0.003/1K). Llama 3.1: open, runs locally, weaker reasoning. Specialized: medical LLMs for healthcare, legal LLMs for contracts. Rule: start with Claude or GPT-4o for prototyping, measure evals, switch to cheaper model if it matches baseline. Avoid 'best model' fallacy β context window + latency often matter more than raw reasoning. Use OpenRouter or LiteLLM to swap easily.