Skip to main content
JobCannon
All skills

Model Deployment & Serving

β¬’ TIER 2Tech
High
Salary impact
6 months
Time to learn
Hard
Difficulty
11
Careers
TL;DR

Model Deployment is the bridge between data science and production: taking trained models and serving them via REST APIs, batch systems, or edge devices with sub-100ms latency and auto-scaling. Career path: ML Engineer L1 (FastAPI/Docker, $115-140k) β†’ Senior (TorchServe/TensorFlow Serving, $140-175k) β†’ Staff (multi-model A/B testing, real-time streaming, $175-225k) over 5-7 months. Critical for ML careers: 60% of ML projects fail at deployment. Involves model serialization (ONNX, pickle), containerization (Docker, Kubernetes), and cloud platforms (AWS SageMaker, Vertex AI, Replicate). Lives next to MLOps, DevOps, and Kubernetes.

What is Model Deployment & Serving

Deploy ML models to production: APIs, batch inference, edge deployment. Optimize for latency, throughput, cost. Critical skill bridging data science and engineering. Learning Curve: Medium-Hard (ML + backend + infrastructure)

πŸ”§ TOOLS & ECOSYSTEM
BentoMLTriton Inference ServerTorchServeTensorFlow ServingSeldonFastAPIModalReplicateAWS SageMakerVertex AICog by ReplicateKServe

πŸ’° Salary by region

RegionJuniorMidSenior
USA$115k$165k$235k
UKΒ£70kΒ£110kΒ£150k
EU€75k€120k€165k
CANADAC$120kC$180kC$240k

❓ FAQ

Batch inference vs real-time API serving β€” when do I use each?
Batch: large volumes of data processed together (daily/hourly), latency tolerance, cost-optimized (GPU time shared). Use for analytics, predictions on data dumps, daily email personalization. Real-time API: sub-second responses, per-request models, paid by throughput. Use for chatbots, recommendations, fraud detection. Hybrid: real-time for hot-path (user-facing), batch for cold-path (reports, nightly jobs). Latency SLA <100ms β†’ API; >1 min batch SLA β†’ batch.
GPU vs CPU serving β€” cost and performance trade-offs?
GPU: 10-100x faster for ML, $1-5 per hour. CPU: 10-100x cheaper, 10-100x slower. Decision: if latency<100ms or throughput>10k req/sec β†’ GPU likely wins despite cost. For CPU-friendly models (linear, tree-based, <100KB), CPU often sufficient. Quantization (int8) can make GPU-weight models run fast on CPU. Batch mode: GPU amortizes cost across many predictions. Real-time: GPU cost per-inference matters.
How do I A/B test model versions in production?
Canary deployments: route 5-10% to v2, 90% to v1, monitor metrics. If v2 wins, shift traffic gradually (10% β†’ 25% β†’ 50% β†’ 100%). Rollback in <5 min if regression. Shadow traffic: send 100% to both, log v2 but serve v1, compare offline metrics first. Multi-armed bandit: contextual routing by user segment (new users on v2, power users on v1). All require feature flags + request routing layer (Seldon, Istio, Lambda aliases).
Blue-green model deployment β€” zero-downtime switching?
Deploy v2 alongside v1 (both active, v1 receives traffic). Smoke tests on v2 in parallel. Once healthy, switch load balancer to v2 instantly. If issues, rollback to v1 in <1min. Requires: two independent model servers, shared database/cache layer, config-based routing. Container orchestration (Kubernetes) handles this natively. Cost: 2x infrastructure during switchover. Alternative: canary (slower, cheaper) or shadow (no downtime but delayed feedback).
Model versioning strategies β€” which approach for production?
Semantic versioning (v1.0.0 = major.minor.patch): breaks compatibility, new features, bug fixes. Model registry (MLflow, Hugging Face Hub): one source of truth for artifacts + metrics + lineage. Container tagging (myrepo/ml-model:v1.0.0-sha256): immutable, reproducible. For critical models: include training date, dataset hash, hyperparams in metadata. Rollback strategy: always keep N-1 version alive; switching routes through config, not redeployment.
What are latency budgets and how do I measure them?
Latency budget = SLA: if 'respond in 100ms', allocate 10ms to preprocessing, 70ms to model inference, 20ms to postprocessing + network. Profile each stage: `timeit(preprocess)`, TensorBoard for inference time breakdown, `perf` for system calls. Monitor p50/p95/p99 in production (not just mean). For streaming/real-time: p99 < SLA (tail latency matters). If over budget: quantize model, prune layers, batch inference, or upgrade hardware.
Auto-scaling spikes β€” handling traffic surges without crashing?
Horizontal scaling: add replicas when load spikes (Kubernetes HPA, AWS ALB target groups). Metrics: CPU >70%, memory >80%, or custom metric (request queue length). Cold-start mitigation: keep min replicas warm. Vertical scaling: bigger instances if bottleneck is single-process (but hit ceiling fast). Queue pattern: queue requests, process async, return job ID + polling. For ML: GPU scaling slower than CPU (minutes to provision), so min replicas must cover baseline + 50% headroom. Circuit breaker: reject requests gracefully if capacity exhausted (return error, not timeout).

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’