Design a Chat App (WhatsApp/Messenger)
Design 1:1 and group messaging with delivery receipts, online presence, and offline message storage.
Required building blocks
WebSockets
Message Queue
Wide-Column Store
Load Balancer
Pub/Sub
Nice to have
Sharding
Object Storage (Blob)
Canonical answer
Persistent WS per client; message queue for offline delivery; wide-column store keyed by (conversation_id, ts). Object storage for media attachments.
Capacity estimation
- 1B registered, 500M DAU; avg 40 messages/user/day → 20B messages/day ≈ 230K writes/sec, ~600K/sec at peak.
- Persistent WS: 500M DAU, ~20% concurrent → 100M open sockets across ~50K gateway nodes (≈2K conns/node).
- Message storage: 20B × ~200 B ≈ 4 TB/day, ~1.5 PB/year in wide-column store.
- Media (10% of msgs avg 500 KB) → 1 PB/day to object storage; CDN egress dominates bandwidth.
- Presence: 100M heartbeats every 30s → ~3.3M ops/sec into Redis presence cluster.
- Group fan-out: avg group size 20 → effective write amplification 1.05x (most msgs are 1:1).
Architecture
Mobile / Desktop
│ (WSS)
▼
Edge LB / TLS terminator
│
▼
WS Gateway Cluster ◀──→ Redis Presence (user_id → node_id)
│
▼
Chat Service ──→ Kafka (per-conversation partition)
│ │
│ ▼
│ Persist Worker ─→ Cassandra (messages)
│ │
│ └─→ Push Worker ─→ APNs / FCM (offline)
▼
Media Service ─→ S3 / Object Store ─→ CDNAPI
- WS /connect (auth header) — bidirectional message + receipt frames
- POST /messages { conversation_id, body, client_msg_id } → { message_id, ts }
- GET /conversations/:id/messages?before=ts → { messages[] }
- POST /media (multipart) → { media_id, url } (then attach to message)
- POST /receipts { message_id, status: delivered|read } → 204
Data model
messages (wide-column, partition = conversation_id, cluster = ts):
conversation_id (PK) : uuid
ts : timeuuid
sender_id : uuid
body : string
media_id : uuid?
status : enum(sent|delivered|read)
conversations (KV):
conversation_id (PK) : uuid
type : enum(direct|group)
member_ids : list<uuid>
last_msg_ts : timestamp
presence (Redis, TTL 60s):
user_id : string
node_id : string
last_seen : timestampConcept blurbs
WebSockets
Persistent bidirectional connection for low-latency push to clients.
Message Queue
Decouple producers/consumers; buffer bursts; enable retries (SQS/RabbitMQ).
Wide-Column Store
Sparse rows over many columns; time-series friendly (Cassandra, HBase, Bigtable).
Load Balancer
Distribute requests across healthy backends (L4 or L7).
Pub/Sub
Fan-out events to many subscribers; topic-based (Kafka, SNS, Redis pub/sub).
Sharding
Partition data across DB instances by key (hash, range, or geography).
Object Storage (Blob)
Cheap durable storage for large immutable blobs (S3, GCS).