βΆKafka vs RabbitMQ β which should I use?
Kafka: persistent log, replay, high throughput (trillions/day), topic-based, distributed by default. RabbitMQ: queuing, immediate delivery, lower latency for small workloads, easy to operate. Use Kafka for data pipelines (logging, event streaming, analytics). Use RabbitMQ for task queues (job processing, microservice messaging). Kafka is the standard for event-driven architectures at scale.
βΆWhat is exactly-once semantics and why is it hard?
Exactly-once = every event processed exactly one time, no duplicates and no loss. Hard because: (1) distributed systems can fail mid-processing, (2) must store state atomically with offset, (3) must handle retries idempotently. Kafka Streams handles this via transactional writes: write result + offset in single transaction. Cost: additional latency + state store overhead. Enable with `processing.guarantee: exactly_once_v2` (not the deprecated v1).
βΆHow do I handle late-arriving data and out-of-order events?
Windowing with grace period: `window.until(Duration.ofMinutes(5))` delays window closure to catch late events. Out-of-order: use event timestamps (not processing time) with `TimestampExtractor`. For critical accuracy: wider windows (hourly vs per-minute) and state stores for deduplication. Kafka Streams 3.5+ supports timestamp recovery for replayed data.
βΆKafka Streams vs Apache Flink β when do I pick which?
Kafka Streams: library (embedded in your app), stateful, runs next to your data, simpler for Kafka-native pipelines, Java/Scala only. Flink: framework (dedicated cluster), lower operational overhead, supports SQL, polyglot (Java/Python/SQL), better complex CEP patterns. Kafka Streams wins if you control the app; Flink if you need a shared cluster serving multiple use cases.
βΆHow do I evolve schemas without breaking consumers?
Use Schema Registry with Avro/Protobuf and compatibility checking. Enable `BACKWARD` (new schema reads old data), `FORWARD` (old schema reads new data), or `FULL` (both). Add optional fields with defaults. Never rename/remove fields without a deprecation window. Test schema changes locally with `avsc` files before pushing to production.
βΆWhen should I NOT use Kafka?
Don't use Kafka for: (1) request-response patterns (use HTTP/gRPC), (2) low-latency <10ms requirements (network overhead is 5-10ms), (3) small data (Kafka's overhead + cluster cost doesn't pay off under 10k events/sec), (4) transactional consistency across services (use distributed transactions or sagas instead), (5) single-machine deployments (use SQLite + polling).
βΆHow do I monitor consumer lag and detect problems?
Consumer lag = max offset - committed offset. Monitor via Confluent Control Center, Prometheus (JMX metrics), or Kafka Admin API. Alerting: lag > 1 hour = investigate. Check: (1) if consumers are running, (2) if processing is stuck (check app logs), (3) if topic is receiving data. Scale consumers = partition count; more partitions = more parallelism. Use dedicated lag monitoring tool (Burrow, Kafka Exporter) for multi-cluster visibility.