βΆIstio vs Linkerd β which service mesh should I pick?
Istio: feature-rich (traffic shifting, WebAssembly, multi-cluster), but heavier (complex configuration, Envoy sidecar overhead ~50MB RAM per pod). Best for enterprises with multiple teams needing fine-grained policies. Linkerd: lightweight (~10MB per pod), simpler, excellent observability, faster to learn (Rust implementation). Best for fast-moving teams, cost-conscious deployments. Consul: bridge between service discovery (Consul) and mesh (Consul Connect), good for hybrid cloud/VMs. Choose: Linkerd if <500 services + small teams; Istio if >1000 services + security/policy complexity; Consul if migrating from Consul discovery.
βΆWhat is mTLS and why does service mesh automate it?
mTLS = mutual TLS encryption between services: client verifies server cert, server verifies client cert. Without mesh: manually manage certificates, rotate keys, update trust stores across hundreds of services = nightmare. Service mesh automates: issues short-lived certificates (days, not years), rotates them transparently, injects proxy certs into requests, no app code changes. Cost: small latency (2-5ms per hop), baseline CPU/memory (sidecar overhead).
βΆWhen should I NOT use a service mesh?
Don't use a service mesh for: (1) <100 services (overhead > value), (2) single-datacenter, simple monoliths converting to microservices (learn microservices patterns first), (3) real-time trading/latency-critical <5ms (sidecar adds 2-10ms), (4) cost-sensitive deployments with tight RAM/CPU budgets (each pod gets 50MB-100MB sidecar), (5) if your team hasn't mastered Kubernetes yet (mesh is Kubernetes-advanced). Start with: raw Kubernetes Ingress + Prometheus + Jaeger, graduate to mesh only when you hit multiple-cluster or mTLS-at-scale pain.
βΆHow does sidecar injection work and what's the performance overhead?
Service mesh uses admission webhooks: when a pod is deployed to a 'mesh-enabled' namespace, Kubernetes calls the mesh webhook, which injects an Envoy sidecar container into the pod spec. Overhead: ~50MB RAM (Istio/Envoy), ~10MB (Linkerd), and 2-10ms latency per request (goes through sidecar proxy). Mitigation: Ambient Mode (Istio 1.18+) = node-wide eBPF redirection instead of per-pod sidecars, cuts overhead 50% but adds kernel dependency. Profile your workload: measure p99 latency before/after mesh in staging.
βΆHow does certificate rotation work in a service mesh?
Service mesh (e.g., Istio) runs a certificate management system: issues workload certificates (valid for ~24h by default), stores them in pod's mounted /etc/certs/. Sidecar reloads certs periodically (no pod restart). Rotation is transparent: mesh handles certificate signing, renewal, distribution. For external services (outside the mesh), use a service entry + certificate secret (manual, but rare). Never rotate certs manually β mesh does it. Just set rotation interval in mesh config (e.g., `security.enableAutoSds: true`).
βΆWhat is ambient mode and should I use it?
Ambient Mode (Istio 1.18+) = sidecar-free service mesh. Instead of per-pod Envoy sidecars, uses node-level eBPF kernel module + lightweight ztunnel daemons for traffic redirection. Benefits: less memory per pod, simpler observability (no pod per-sidecar to monitor), faster onboarding. Drawbacks: requires Linux kernel 4.14+ and eBPF support, fewer knobs for per-pod traffic policies (not all Istio features work in ambient yet). Roadmap: Ambient is the future, but as of April 2026, production-grade use cases still favor sidecar mode.
βΆHow do I debug traffic routing and observability in a service mesh?
Use Kiali (visualization dashboard): pods + services + connections graph. Check Prometheus metrics: `istio_requests_total`, `istio_request_duration_milliseconds`. Enable distributed tracing: Jaeger captures end-to-end request traces. Debug commands: `istioctl analyze`, `istioctl proxy-config routes <pod>`, `istioctl authn tls-check`. For canary deployments: monitor error rate + latency during traffic shift, rollback if error rate > threshold. Common issues: mismatched namespaces (different mesh labels), missing service entry (external service not in mesh), certificate mismatch (enable debug logging with `log-level: debug`).