Question 1

MLOps vs DevOps — what's the difference?

Accepted Answer

DevOps automates software deployment: code → build → test → release. MLOps extends this for ML: data → train → evaluate → deploy → monitor → retrain. The key difference: ML models degrade over time (data drift, concept drift) and require continuous monitoring + retraining, not just deployments. DevOps engineers own infrastructure; MLOps engineers own the model lifecycle.

Question 2

How do I detect and handle model drift?

Accepted Answer

Model drift = prediction accuracy drops without code changes. Detect via: (1) Monitor actual labels vs predictions (post-hoc), (2) Track feature distributions (input drift), (3) Monitor prediction confidence (uncertainty drift). Tools: Evidently, WhyLabs, Arize. Response: retrain on recent data, A/B test new model, trigger alerts. For real-time: use Weights & Biases or custom monitoring dashboards.

Question 3

What's a feature store and why do I need one?

Accepted Answer

Feature store (Feast, Tecton, Hopsworks) is a centralized registry of ML features: reusable transformations + computed values shared across models and teams. Why: (1) avoids training-serving skew (same features in both), (2) enables feature reuse across models, (3) manages feature freshness and SLA. For startups, skip it. For 3+ models, it pays for itself in prevented bugs and faster iteration.

Question 4

Online vs offline serving — which model should I use?

Accepted Answer

Offline: batch scoring on a schedule (e.g. nightly). Fast, cheap, no SLA pressure. Use for: recommendations, reports, ETL. Online: serve predictions in real time via API. Use for: user-facing rankings, fraud detection, real-time personalization. Most mature systems use both: online for user-facing, offline for batch analytics.

Question 5

How do I set up A/B testing for ML models?

Accepted Answer

Route traffic: X% to model A (baseline), Y% to model B (new). Measure: conversion, engagement, latency, cost. Tools: Seldon Core, Ray Serve, custom Flask/FastAPI logic. Duration: min 2 weeks for 100+ conversions. Canary deploys are safer: start at 5% new model, ramp to 100% if metrics hold. Track via PostHog or custom dashboards.

Question 6

How do I manage GPU costs in ML pipelines?

Accepted Answer

GPUs are expensive (~$3/hour on-demand). Strategies: (1) Spot instances (70% discount, preemption risk), (2) Batch jobs overnight (cheaper off-peak), (3) Model quantization (use smaller models), (4) Multi-GPU per job to amortize startup, (5) Kubernetes autoscaling (scale down idle clusters). Monitor via SageMaker, Vertex AI dashboards. For training: Spot is safe. For serving: mix on-demand + Spot with traffic spilling.

Question 7

What's training-serving skew and how do I prevent it?

Accepted Answer

Training-serving skew: model is trained with features computed one way, but served with features computed differently. Example: training uses 90-day average, serving uses 30-day average. Result: model works offline, fails in production. Fix: (1) Use a feature store to guarantee same logic, (2) Unit test feature pipelines, (3) Monitor input distributions post-deploy, (4) Freeze training code and replicate it exactly in serving code.

Region	Junior	Mid	Senior
USA	$120k	$165k	$220k
UK	£75k	£105k	£160k
EU	€80k	€115k	€175k
CANADA	C$125k	C$170k	C$265k

MLOps (ML Operations)

What is MLOps (ML Operations)

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using MLOps (ML Operations)

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path