Design a Notification System
Send push, email, and SMS notifications to millions with per-user preferences, deduplication, and retries.
Required building blocks
Message Queue
Pub/Sub
Rate Limiting
Key-Value Store
Circuit Breaker
Nice to have
Bulkhead
API Gateway
Canonical answer
Pub/sub fan-out by channel; per-channel worker pools (bulkhead); circuit breaker on external providers (APNs/FCM/Twilio); dedup key in KV.
Capacity estimation
- 200M users, ~10 notifications/user/day → 2B notifications/day ≈ 23K send/sec, ~60K at peak.
- Channel mix ~70% push, 20% email, 10% SMS → 16K push/sec, 4.6K email/sec, 2.3K SMS/sec.
- Per-provider quotas: APNs caps connections, Twilio has per-number rps → bulkhead worker pools + token-bucket per provider.
- Dedup window: keep 24h of (user_id, notif_key) hashes → ~2B entries × 24 B ≈ 50 GB in Redis cluster.
- Retry queue: assume 2% transient failure × 5 retries → ~5% queue amplification headroom.
- User prefs store: 200M users × ~200 B prefs ≈ 40 GB in KV, sharded by user_id.
Architecture
Producers (services) ─→ API Gateway
│
▼
Notification Service
│
▼
Kafka (topic-per-channel)
┌─────────────┼─────────────┐
▼ ▼ ▼
Push Workers Email Workers SMS Workers
(bulkhead) (bulkhead) (bulkhead)
│ │ │
▼ ▼ ▼
APNs/FCM ◀─circuit breaker─→ SES / Twilio
│
▼
Delivery Log (Cassandra) → Analytics
▲
Prefs / Dedup (Redis) ◀── workers check before sendAPI
- POST /notify { user_id, template_id, vars, channels[], dedup_key? } → { notif_id, status: queued }
- GET /notifications/:id → { channel_attempts[], final_status }
- PUT /users/:id/preferences { channels: { push: true, email: false }, quiet_hours } → 204
- POST /devices/register { user_id, platform, token } → 204
- Webhook: POST /providers/:name/callback (delivery receipts) → 200
Data model
notifications (Cassandra, partition = user_id, cluster = ts DESC):
notif_id (PK) : uuid
user_id : uuid
template_id : string
channel : enum(push|email|sms)
status : enum(queued|sent|delivered|failed|suppressed)
attempts : int
dedup_key : string?
created_at : timestamp
user_prefs (KV, key = user_id):
channels : map<channel, bool>
quiet_hours : { start, end, tz }
topics_muted : list<string>
dedup (Redis, key = "dd:{user_id}:{dedup_key}", TTL 24h):
value : notif_idConcept blurbs
Message Queue
Decouple producers/consumers; buffer bursts; enable retries (SQS/RabbitMQ).
Pub/Sub
Fan-out events to many subscribers; topic-based (Kafka, SNS, Redis pub/sub).
Rate Limiting
Throttle requests per user/IP with token bucket, leaky bucket, or sliding window.
Key-Value Store
O(1) get/put by key; massively scalable (DynamoDB, Redis, Cassandra).
Circuit Breaker
Stop calling failing dependencies; fail fast and recover gracefully.
Bulkhead
Isolate resource pools so one failing dependency can't sink the whole service.
API Gateway
Single entry point: auth, rate limit, routing, transformation.