βΆTransformers vs RNNs/LSTMs β why did transformers win?
RNNs (LSTM, GRU) process sequences one token at a time, can't parallelize, forget long-range context. Transformers (BERT, GPT) process all tokens in parallel via self-attention, capture long-range dependencies, train 10-100x faster. Attention is All You Need (2017) proved it. By 2020: transformers = industry standard. Learning RNNs = learning history; building production NLP = transformers only.
βΆBERT vs GPT β which should I learn first?
BERT = bidirectional, pretrained on masked language modeling, best for classification/understanding tasks (sentiment, entity extraction, Q&A). GPT = unidirectional (left-to-right), pretrained on causal language modeling, best for generation (chatbots, summarization, translation). Learn BERT first (conceptually simpler, https://huggingface.co/course/chapter1), then GPT. In 2026: fine-tune BERT for custom classifiers, use GPT for chat/generation via API.
βΆFine-tuning vs RAG (Retrieval Augmented Generation) vs prompt engineering β when to use each?
Prompt engineering = free, fast, no training (try first). RAG = retrieve relevant documents from a vector database, feed to LLM context, good for knowledge-intensive tasks (Q&A over company docs), no retraining. Fine-tuning = expensive (GPUs), slow (hours-days), but fits the model to your data/style. Pick: (1) try prompt engineering, (2) if context window insufficient, add RAG, (3) if LLM still fails, fine-tune. 80% of use cases = prompt engineering + RAG.
βΆHow do I host/deploy an LLM myself vs using an API?
API (OpenAI, Anthropic) = $0.01-0.1 per 1k tokens, lowest latency, no infra. Self-host small LLMs (Llama 7B, Mistral) = $0.50-5/hour GPU, latency 100-500ms, full control, privacy. Rule: API for prototyping and scaling (chat, content generation), self-host for privacy-critical apps (healthcare, finance) or if volume > 10M tokens/month. 2026 trend: smaller specialized models (MistralAI) on your infrastructure, not giant models via API.
βΆVector databases and embeddings β Pinecone vs Weaviate vs building DIY?
Embeddings = convert text to numbers (768-1536 dims), enable semantic search. Pinecone = managed (easiest, $0.10-1/month), Weaviate = self-host (free, complex), DIY = index with NumPy (only for <10k docs). For production: Pinecone if budget available, Weaviate if on-premise required, DIY only for prototypes. All use sentence-transformers (all-MiniLM-L6-v2) for encoding and cosine similarity for retrieval.
βΆMultilingual NLP β how hard is it to support multiple languages?
Monolingual models (English BERT) fail on other languages. Solutions: (1) multilingual BERT (mBERT, XLM-RoBERTa) for 100+ langs, lower quality, (2) language-specific models (French BERT, German BERT) for top langs, better quality, (3) translate to English (lossy but works). Recommendation: mBERT for MVP, switch to language-specific for each supported language in production. Don't try to build language-universal; use existing multilingual checkpoints from Hugging Face.
βΆHow do I evaluate NLP models β metrics beyond accuracy?
Classification: precision/recall/F1 (imbalanced), ROC-AUC (ranking). Generation (summarization, translation): ROUGE (n-gram overlap), BLEU (precision on n-grams), human evaluation (expensive). Token classification (NER): micro/macro F1 (per-token). Semantic similarity: cosine similarity, human correlation. NEVER use accuracy alone for text tasks β almost all are imbalanced. For LLMs: use LLM-as-judge (ask GPT to score quality) for generation tasks.