▶Queue vs Pub/Sub — what's the difference and when do I use each?
Queues (RabbitMQ, SQS): one-to-one message delivery. Producer sends → single consumer processes → message is gone. Use for: tasks (send email, process payment), load balancing, FIFO ordering. Pub/Sub (Kafka, SNS, Pub/Sub): one-to-many. Producer publishes → all subscribers receive copy independently. Use for: events (user signed up, order placed), analytics, multiple systems reacting. Hybrid: SQS + SNS (publish events to SNS topic, route to 10 SQS queues for different consumers).
▶How do I guarantee exactly-once message processing?
Exactly-once is hard: (1) Kafka with idempotent producer config + transactional reads on consumer side (costs performance). (2) At-least-once + idempotent consumer (safer, simpler): process message, write result + offset to DB in same transaction, skip duplicates. RabbitMQ: ack after processing, not before. SQS: visibility timeout + delete after processing. Real answer: exactly-once with <100ms latency costs 2-3x more than at-least-once; pick at-least-once + idempotency for 99% of cases.
▶What's a dead-letter queue (DLQ) and why do I need it?
A queue for messages that failed processing after N retries. Flow: (1) Consumer receives message, (2) Processing fails, (3) Message re-queued, (4) After 3 retries, message goes to DLQ. DLQ holds failed messages for debugging: log them, alert ops, replay manually later, or send to Slack. Without DLQ, failed messages either get lost (dropped) or cause the queue to halt. Always set up DLQ for production.
▶How do message ordering guarantees work across different systems?
RabbitMQ: FIFO per queue (not distributed). Kafka: FIFO per partition (publish to same partition key = ordered). SQS: FIFO queues cost 2x but guarantee ordering, standard queues = best-effort. Redis Streams: FIFO by design. For multi-consumer scenarios: Kafka scales ordering across partitions (different keys = different consumers), RabbitMQ needs single consumer for strict FIFO. Lesson: ordering + horizontal scaling = requires partitioning strategy.
▶What is backpressure and how do I handle it?
Backpressure: producer publishes faster than consumer can process, queue fills up, memory/disk exhausted. Solutions: (1) Consumer pull model (SQS, Kafka) — consumer asks for N messages when ready, (2) Batch processing — consume 100, process in parallel, (3) Auto-scaling — add consumers when queue depth > threshold, (4) Rate limiting — producer publishes slower. Use prefetch size (RabbitMQ: prefetch=10, Kafka: batch.size) to avoid overwhelming consumer. Monitor: queue depth / consumer lag (Kafka lag = consumer offset vs latest offset).
▶How do I choose the right message queue for my architecture?
RabbitMQ: complex routing (headers, topic exchanges), priorities, 10k+ msgs/sec, monolithic ops team. Kafka: high throughput (1M+ msgs/sec), event streaming, analytics, retention >days, DevOps-friendly, steep ops curve. SQS: AWS-native, serverless, simple, pay-per-request. Redis Streams: single-node, sub-millisecond latency, Redis ops team. NATS: lightweight, IoT/edge. Heuristic: SQS if <100k msgs/day and AWS-only, RabbitMQ if complex topology, Kafka if >1M msgs/day or events need history.
▶How do I monitor and alert on queue health?
Metrics: (1) Queue depth (pending messages), (2) Consumer lag (how far behind is consumer), (3) Message age (oldest unprocessed), (4) Error rate (failed messages), (5) Throughput (msgs/sec). Tools: Kafka UI, RabbitMQ management UI, CloudWatch (SQS), Datadog. Alerts: lag > 10min OR queue depth > 1M OR error rate > 1%. Set up dashboards per queue type. For Kafka: use Burrow or Confluent Control Center (offset tracking). For RabbitMQ: check management plugin (rabbitmq_management).