βΆREST vs gRPC vs GraphQL for inter-service communication β when do I pick which?
REST: easy to understand, HTTP caching, wide tooling, but verbose payloads and request/response per call. gRPC: binary protocol (Protobuf), streaming, 10-100x faster, but requires proxy/gateway complexity. GraphQL: flexible queries, over-fetching solved, but adds query validation overhead and isn't great for high-throughput, low-latency services. Use REST for public APIs + simple internal services. gRPC for high-throughput, latency-sensitive service meshes. GraphQL for flexible client queries (BFF/frontend). Most modern microservices use both: gRPC inside, REST outside.
βΆSynchronous vs asynchronous communication β how do I decide?
Synchronous (REST, gRPC): caller waits, immediate feedback, request/response is natural. Downsides: tight coupling, cascading failures, harder to scale. Asynchronous (message queues, pub/sub): fire-and-forget, decoupled, resilient to failures, enables event-driven. Downside: eventual consistency, harder to debug, ordering challenges. Rule: default to async for anything that doesn't require immediate feedback (orders, events, notifications). Use sync only for low-latency critical paths. Most apps use both: async for order processing + events, sync for payment/auth.
βΆHow do I handle retries, idempotency, and exactly-once delivery?
Idempotency is essential β message handlers must be safe to call 2+ times with the same input (same result). Use unique identifiers (idempotency keys) or design handlers to be stateless. Retries: exponential backoff (1s, 2s, 4s, ...) with jitter to prevent thundering herd. Dead-letter queues (DLQ) for messages that fail after N retries. Exactly-once delivery is hard: most systems offer at-least-once + idempotent handlers. Some message brokers (Kafka, RabbitMQ Streams) support transactional semantics for closer-to-once guarantees.
βΆWhat's the saga pattern and when do I use it?
Saga pattern is for distributed transactions across microservices (can't use ACID locks across databases). Two flavors: orchestration (central saga coordinator) and choreography (event-driven, each service listens/reacts). Example: Order β Reserve Inventory β Process Payment β Ship. If payment fails, compensating transactions roll back (Refund Inventory, etc.). Orchestration is easier to understand + debug but adds central point of failure. Choreography is more resilient but harder to trace. Use when you need multi-step transactional flows across service boundaries.
βΆHow do I ensure schema evolution without breaking consumers?
Use versioning in your contract: API versions (v1/v2 routes), message envelope versions, or Semantic Versioning for Protobuf/Avro. Producer can add optional fields; consumers ignore unknown fields. Avoid removing or renaming fields β deprecate instead. Use schema registries (Confluent Schema Registry for Kafka, gRPC reflection) to track versions. AsyncAPI/OpenAPI specs document contracts explicitly. Always test producers + consumers together in integration tests.
βΆWhat's event sourcing and when is it worth it?
Event sourcing: instead of storing state, store an immutable log of all events that led to that state. Replay events to rebuild state. Advantages: perfect audit trail, temporal queries (what was the order status at 3pm?), no dual-writes. Disadvantages: complex (eventual consistency, snapshots, event upcasting), huge log size, different query patterns. Use for domains with strong audit needs (finance, healthcare, event tracking) or where history matters. Don't use for everything β most apps are fine with CRUD + change logs.
βΆHow do I implement circuit breakers and prevent cascading failures?
Circuit breaker pattern: if a service fails N times in X seconds, 'open' the circuit (fail fast, don't call it). After Y seconds, 'half-open' and try again. Libraries: Resilience4j (Java), Polly (.NET), Brakes (Node). Use with timeouts + bulkheads (thread pools) to isolate failures. Combine with retries (with backoff) and fallbacks. Example: if PaymentService is down, reject orders instead of hanging + piling up requests. Monitor circuit state in dashboards. Most production systems use Resilience4j or service mesh (Istio) for circuit breaking.