Design Dropbox / File Sync

Design Dropbox / File Sync

Multi-device file sync with conflict resolution, deduplication, and resumable uploads.

Required building blocks
Object Storage (Blob)
Relational DB (SQL)
Message Queue
Change Data Capture (CDC)
Nice to have
WebSockets
Circuit Breaker
Canonical answer

Chunk files into ~4MB blocks; SHA-256 dedup. Metadata in SQL. Notify clients via WS or long-poll. Resumable via chunk-level acks.

Capacity estimation
  • 500M users, 100M DAU; avg 10 file changes/day → 1B changes/day ≈ 12K writes/sec.
  • Chunk size 4 MB; avg file 1 MB-100 MB → ~5B chunks/day with dedup ratio ~30% → ~3.5B unique chunks/day.
  • Storage growth: 3.5B × 4 MB ≈ 14 PB/day raw; with EC + cross-region replica ≈ 25 PB/day to object store.
  • Metadata DB: ~100B files lifetime × 500 B ≈ 50 TB in sharded SQL (by user_id).
  • Sync notification: 100M DAU × ~20% online → 20M WS connections across ~10K gateway nodes.
  • CDN download path saves ~60% of egress for shared/popular files.
Architecture
Desktop / Mobile Client
        │
        ├─→ Chunker (4MB blocks, SHA-256)
        │
        ▼
   Block Service ──→ Object Storage (S3, dedup by hash)
        │
        ▼
   Metadata Service ──→ SQL (sharded by user_id)
        │
        ▼
   Change Log (CDC / Kafka)
        │
        ▼
   Notification Service ──WS──→ Other devices
        │
        ▼
   Conflict Resolver (vector clocks / "rename loser")
API
  • POST /blocks { sha256, size } → { upload_url } | { already_exists: true }
  • PUT /blocks/:sha256 (binary) → 200
  • POST /files/commit { path, version, block_list[sha256...] } → { file_id, rev }
  • GET /files/:id?rev=… → { manifest, block_urls[] }
  • GET /changes?cursor=… → { events[], next_cursor }
  • WS /notify — pushes { file_id, rev, actor } on change
Data model
files (SQL, sharded by user_id):
  file_id (PK)    : uuid
  user_id         : uuid
  path            : string
  current_rev     : int
  size            : bigint
  modified_at     : timestamp
  deleted         : bool

file_revisions (SQL):
  file_id (FK)    : uuid
  rev             : int
  block_list      : json array of sha256
  author_device   : uuid
  ts              : timestamp

blocks (KV / object store):
  sha256 (PK)     : string
  size            : int
  refcount        : int       (for GC)
  s3_key          : string
Concept blurbs
Object Storage (Blob)
Cheap durable storage for large immutable blobs (S3, GCS).
Relational DB (SQL)
ACID, joins, secondary indexes; vertical scale + read replicas.
Message Queue
Decouple producers/consumers; buffer bursts; enable retries (SQS/RabbitMQ).
Change Data Capture (CDC)
Stream DB row changes (Debezium → Kafka) to downstream indexes/caches.
WebSockets
Persistent bidirectional connection for low-latency push to clients.
Circuit Breaker
Stop calling failing dependencies; fail fast and recover gracefully.