Design a Notification System

Design a Notification System

Send push, email, and SMS notifications to millions with per-user preferences, deduplication, and retries.

Required building blocks
Message Queue
Pub/Sub
Rate Limiting
Key-Value Store
Circuit Breaker
Nice to have
Bulkhead
API Gateway
Canonical answer

Pub/sub fan-out by channel; per-channel worker pools (bulkhead); circuit breaker on external providers (APNs/FCM/Twilio); dedup key in KV.

Capacity estimation
  • 200M users, ~10 notifications/user/day → 2B notifications/day ≈ 23K send/sec, ~60K at peak.
  • Channel mix ~70% push, 20% email, 10% SMS → 16K push/sec, 4.6K email/sec, 2.3K SMS/sec.
  • Per-provider quotas: APNs caps connections, Twilio has per-number rps → bulkhead worker pools + token-bucket per provider.
  • Dedup window: keep 24h of (user_id, notif_key) hashes → ~2B entries × 24 B ≈ 50 GB in Redis cluster.
  • Retry queue: assume 2% transient failure × 5 retries → ~5% queue amplification headroom.
  • User prefs store: 200M users × ~200 B prefs ≈ 40 GB in KV, sharded by user_id.
Architecture
Producers (services) ─→ API Gateway
                          │
                          ▼
                  Notification Service
                          │
                          ▼
                   Kafka (topic-per-channel)
            ┌─────────────┼─────────────┐
            ▼             ▼             ▼
        Push Workers  Email Workers  SMS Workers
        (bulkhead)    (bulkhead)     (bulkhead)
            │             │             │
            ▼             ▼             ▼
        APNs/FCM ◀─circuit breaker─→ SES / Twilio
            │
            ▼
        Delivery Log (Cassandra) → Analytics
                 ▲
   Prefs / Dedup (Redis) ◀── workers check before send
API
  • POST /notify { user_id, template_id, vars, channels[], dedup_key? } → { notif_id, status: queued }
  • GET /notifications/:id → { channel_attempts[], final_status }
  • PUT /users/:id/preferences { channels: { push: true, email: false }, quiet_hours } → 204
  • POST /devices/register { user_id, platform, token } → 204
  • Webhook: POST /providers/:name/callback (delivery receipts) → 200
Data model
notifications (Cassandra, partition = user_id, cluster = ts DESC):
  notif_id (PK)   : uuid
  user_id         : uuid
  template_id     : string
  channel         : enum(push|email|sms)
  status          : enum(queued|sent|delivered|failed|suppressed)
  attempts        : int
  dedup_key       : string?
  created_at      : timestamp

user_prefs (KV, key = user_id):
  channels        : map<channel, bool>
  quiet_hours     : { start, end, tz }
  topics_muted    : list<string>

dedup (Redis, key = "dd:{user_id}:{dedup_key}", TTL 24h):
  value           : notif_id
Concept blurbs
Message Queue
Decouple producers/consumers; buffer bursts; enable retries (SQS/RabbitMQ).
Pub/Sub
Fan-out events to many subscribers; topic-based (Kafka, SNS, Redis pub/sub).
Rate Limiting
Throttle requests per user/IP with token bucket, leaky bucket, or sliding window.
Key-Value Store
O(1) get/put by key; massively scalable (DynamoDB, Redis, Cassandra).
Circuit Breaker
Stop calling failing dependencies; fail fast and recover gracefully.
Bulkhead
Isolate resource pools so one failing dependency can't sink the whole service.
API Gateway
Single entry point: auth, rate limit, routing, transformation.