Design a Chat App (WhatsApp/Messenger)

Design a Chat App (WhatsApp/Messenger)

Design 1:1 and group messaging with delivery receipts, online presence, and offline message storage.

Required building blocks
WebSockets
Message Queue
Wide-Column Store
Load Balancer
Pub/Sub
Nice to have
Sharding
Object Storage (Blob)
Canonical answer

Persistent WS per client; message queue for offline delivery; wide-column store keyed by (conversation_id, ts). Object storage for media attachments.

Capacity estimation
  • 1B registered, 500M DAU; avg 40 messages/user/day → 20B messages/day ≈ 230K writes/sec, ~600K/sec at peak.
  • Persistent WS: 500M DAU, ~20% concurrent → 100M open sockets across ~50K gateway nodes (≈2K conns/node).
  • Message storage: 20B × ~200 B ≈ 4 TB/day, ~1.5 PB/year in wide-column store.
  • Media (10% of msgs avg 500 KB) → 1 PB/day to object storage; CDN egress dominates bandwidth.
  • Presence: 100M heartbeats every 30s → ~3.3M ops/sec into Redis presence cluster.
  • Group fan-out: avg group size 20 → effective write amplification 1.05x (most msgs are 1:1).
Architecture
Mobile / Desktop
       │  (WSS)
       ▼
  Edge LB / TLS terminator
       │
       ▼
  WS Gateway Cluster ◀──→ Redis Presence (user_id → node_id)
       │
       ▼
  Chat Service ──→ Kafka (per-conversation partition)
       │              │
       │              ▼
       │         Persist Worker ─→ Cassandra (messages)
       │              │
       │              └─→ Push Worker ─→ APNs / FCM (offline)
       ▼
  Media Service ─→ S3 / Object Store ─→ CDN
API
  • WS /connect (auth header) — bidirectional message + receipt frames
  • POST /messages { conversation_id, body, client_msg_id } → { message_id, ts }
  • GET /conversations/:id/messages?before=ts → { messages[] }
  • POST /media (multipart) → { media_id, url } (then attach to message)
  • POST /receipts { message_id, status: delivered|read } → 204
Data model
messages (wide-column, partition = conversation_id, cluster = ts):
  conversation_id (PK) : uuid
  ts                   : timeuuid
  sender_id            : uuid
  body                 : string
  media_id             : uuid?
  status               : enum(sent|delivered|read)

conversations (KV):
  conversation_id (PK) : uuid
  type                 : enum(direct|group)
  member_ids           : list<uuid>
  last_msg_ts          : timestamp

presence (Redis, TTL 60s):
  user_id              : string
  node_id              : string
  last_seen            : timestamp
Concept blurbs
WebSockets
Persistent bidirectional connection for low-latency push to clients.
Message Queue
Decouple producers/consumers; buffer bursts; enable retries (SQS/RabbitMQ).
Wide-Column Store
Sparse rows over many columns; time-series friendly (Cassandra, HBase, Bigtable).
Load Balancer
Distribute requests across healthy backends (L4 or L7).
Pub/Sub
Fan-out events to many subscribers; topic-based (Kafka, SNS, Redis pub/sub).
Sharding
Partition data across DB instances by key (hash, range, or geography).
Object Storage (Blob)
Cheap durable storage for large immutable blobs (S3, GCS).