Question 1

Batch inference vs real-time API serving — when do I use each?

Accepted Answer

Batch: large volumes of data processed together (daily/hourly), latency tolerance, cost-optimized (GPU time shared). Use for analytics, predictions on data dumps, daily email personalization. Real-time API: sub-second responses, per-request models, paid by throughput. Use for chatbots, recommendations, fraud detection. Hybrid: real-time for hot-path (user-facing), batch for cold-path (reports, nightly jobs). Latency SLA <100ms → API; >1 min batch SLA → batch.

Question 2

GPU vs CPU serving — cost and performance trade-offs?

Accepted Answer

GPU: 10-100x faster for ML, $1-5 per hour. CPU: 10-100x cheaper, 10-100x slower. Decision: if latency<100ms or throughput>10k req/sec → GPU likely wins despite cost. For CPU-friendly models (linear, tree-based, <100KB), CPU often sufficient. Quantization (int8) can make GPU-weight models run fast on CPU. Batch mode: GPU amortizes cost across many predictions. Real-time: GPU cost per-inference matters.

Question 3

How do I A/B test model versions in production?

Accepted Answer

Canary deployments: route 5-10% to v2, 90% to v1, monitor metrics. If v2 wins, shift traffic gradually (10% → 25% → 50% → 100%). Rollback in <5 min if regression. Shadow traffic: send 100% to both, log v2 but serve v1, compare offline metrics first. Multi-armed bandit: contextual routing by user segment (new users on v2, power users on v1). All require feature flags + request routing layer (Seldon, Istio, Lambda aliases).

Question 4

Blue-green model deployment — zero-downtime switching?

Accepted Answer

Deploy v2 alongside v1 (both active, v1 receives traffic). Smoke tests on v2 in parallel. Once healthy, switch load balancer to v2 instantly. If issues, rollback to v1 in <1min. Requires: two independent model servers, shared database/cache layer, config-based routing. Container orchestration (Kubernetes) handles this natively. Cost: 2x infrastructure during switchover. Alternative: canary (slower, cheaper) or shadow (no downtime but delayed feedback).

Question 5

Model versioning strategies — which approach for production?

Accepted Answer

Semantic versioning (v1.0.0 = major.minor.patch): breaks compatibility, new features, bug fixes. Model registry (MLflow, Hugging Face Hub): one source of truth for artifacts + metrics + lineage. Container tagging (myrepo/ml-model:v1.0.0-sha256): immutable, reproducible. For critical models: include training date, dataset hash, hyperparams in metadata. Rollback strategy: always keep N-1 version alive; switching routes through config, not redeployment.

Question 6

What are latency budgets and how do I measure them?

Accepted Answer

Latency budget = SLA: if 'respond in 100ms', allocate 10ms to preprocessing, 70ms to model inference, 20ms to postprocessing + network. Profile each stage: `timeit(preprocess)`, TensorBoard for inference time breakdown, `perf` for system calls. Monitor p50/p95/p99 in production (not just mean). For streaming/real-time: p99 < SLA (tail latency matters). If over budget: quantize model, prune layers, batch inference, or upgrade hardware.

Question 7

Auto-scaling spikes — handling traffic surges without crashing?

Accepted Answer

Horizontal scaling: add replicas when load spikes (Kubernetes HPA, AWS ALB target groups). Metrics: CPU >70%, memory >80%, or custom metric (request queue length). Cold-start mitigation: keep min replicas warm. Vertical scaling: bigger instances if bottleneck is single-process (but hit ceiling fast). Queue pattern: queue requests, process async, return job ID + polling. For ML: GPU scaling slower than CPU (minutes to provision), so min replicas must cover baseline + 50% headroom. Circuit breaker: reject requests gracefully if capacity exhausted (return error, not timeout).

Region	Junior	Mid	Senior
USA	$115k	$165k	$235k
UK	£70k	£110k	£150k
EU	€75k	€120k	€165k
CANADA	C$120k	C$180k	C$240k

Model Deployment & Serving

What is Model Deployment & Serving

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Model Deployment & Serving

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path