βΆBatch inference vs real-time API serving β when do I use each?
Batch: large volumes of data processed together (daily/hourly), latency tolerance, cost-optimized (GPU time shared). Use for analytics, predictions on data dumps, daily email personalization. Real-time API: sub-second responses, per-request models, paid by throughput. Use for chatbots, recommendations, fraud detection. Hybrid: real-time for hot-path (user-facing), batch for cold-path (reports, nightly jobs). Latency SLA <100ms β API; >1 min batch SLA β batch.
βΆGPU vs CPU serving β cost and performance trade-offs?
GPU: 10-100x faster for ML, $1-5 per hour. CPU: 10-100x cheaper, 10-100x slower. Decision: if latency<100ms or throughput>10k req/sec β GPU likely wins despite cost. For CPU-friendly models (linear, tree-based, <100KB), CPU often sufficient. Quantization (int8) can make GPU-weight models run fast on CPU. Batch mode: GPU amortizes cost across many predictions. Real-time: GPU cost per-inference matters.
βΆHow do I A/B test model versions in production?
Canary deployments: route 5-10% to v2, 90% to v1, monitor metrics. If v2 wins, shift traffic gradually (10% β 25% β 50% β 100%). Rollback in <5 min if regression. Shadow traffic: send 100% to both, log v2 but serve v1, compare offline metrics first. Multi-armed bandit: contextual routing by user segment (new users on v2, power users on v1). All require feature flags + request routing layer (Seldon, Istio, Lambda aliases).
βΆBlue-green model deployment β zero-downtime switching?
Deploy v2 alongside v1 (both active, v1 receives traffic). Smoke tests on v2 in parallel. Once healthy, switch load balancer to v2 instantly. If issues, rollback to v1 in <1min. Requires: two independent model servers, shared database/cache layer, config-based routing. Container orchestration (Kubernetes) handles this natively. Cost: 2x infrastructure during switchover. Alternative: canary (slower, cheaper) or shadow (no downtime but delayed feedback).
βΆModel versioning strategies β which approach for production?
Semantic versioning (v1.0.0 = major.minor.patch): breaks compatibility, new features, bug fixes. Model registry (MLflow, Hugging Face Hub): one source of truth for artifacts + metrics + lineage. Container tagging (myrepo/ml-model:v1.0.0-sha256): immutable, reproducible. For critical models: include training date, dataset hash, hyperparams in metadata. Rollback strategy: always keep N-1 version alive; switching routes through config, not redeployment.
βΆWhat are latency budgets and how do I measure them?
Latency budget = SLA: if 'respond in 100ms', allocate 10ms to preprocessing, 70ms to model inference, 20ms to postprocessing + network. Profile each stage: `timeit(preprocess)`, TensorBoard for inference time breakdown, `perf` for system calls. Monitor p50/p95/p99 in production (not just mean). For streaming/real-time: p99 < SLA (tail latency matters). If over budget: quantize model, prune layers, batch inference, or upgrade hardware.
βΆAuto-scaling spikes β handling traffic surges without crashing?
Horizontal scaling: add replicas when load spikes (Kubernetes HPA, AWS ALB target groups). Metrics: CPU >70%, memory >80%, or custom metric (request queue length). Cold-start mitigation: keep min replicas warm. Vertical scaling: bigger instances if bottleneck is single-process (but hit ceiling fast). Queue pattern: queue requests, process async, return job ID + polling. For ML: GPU scaling slower than CPU (minutes to provision), so min replicas must cover baseline + 50% headroom. Circuit breaker: reject requests gracefully if capacity exhausted (return error, not timeout).