Techniques
Production idioms — the smart tricks you weave into algorithms and systems. Exponential backoff, idempotency keys, reservoir sampling, Snowflake IDs, leases, and more.
Grouped by category · 27 entries
Reliability
6Exponential Backoff + Jitter
Grow retry intervals exponentially and add randomization so clients don't synchronize their retries.
Calling any unreliable downstream (network, third-party API, flaky DB). Without jitter, a thundering-herd of clients all retry at the same moment after an outage.
AWS SDK clients retry with full jitter: sleep = random(0, base × 2^attempt). Both Google and Stripe SDKs follow the same pattern.
def retry(fn, max_attempts=5, base_ms=100, cap_ms=10_000):
for attempt in range(max_attempts):
try:
return fn()
except TransientError:
if attempt == max_attempts - 1:
raise
backoff = min(cap_ms, base_ms * 2 ** attempt)
jitter = random.uniform(0, backoff)
time.sleep(jitter / 1000)Idempotency Keys
Caller sends a unique key with each request; server dedupes on it so retries don't double-charge.
Any non-idempotent mutation that might be retried: payments, signups, order creation. Network errors mid-flight are normal; you don't know whether the server committed.
Stripe's `Idempotency-Key` header. Server stores key → response for 24h; second request with the same key returns the cached response instead of re-running the work.
Hedged Requests
If the first request hasn't returned by some threshold (p95), fire a second one and take whichever finishes first.
Tail-latency-sensitive reads where the long tail is wasted (e.g., a single slow replica). Costs ~2x reads on the tail but slashes p99 latency.
Google's BigTable + Spanner read clients hedge to a second replica after ~95th percentile of recent latency. Used internally by search to keep query latency consistent.
Deadline / Retry Budget
Bound the total time spent retrying, not the number of attempts. Each retry costs from the same budget.
Any call inside a larger request that itself has a timeout. Counting attempts is the wrong unit: 5 retries with backoff can blow past your caller's 200ms SLA.
gRPC's context deadlines. Each downstream RPC inherits a deadline; retries must finish before it expires. Envoy and Istio implement the same with a `x-envoy-upstream-rq-timeout-ms` header.
Circuit Breaker
Track failures from a dependency; once they exceed a threshold, open the circuit and fail fast for a cool-down period.
Calling a dependency that can degrade your own health if you keep retrying. Without a breaker, your threads pile up on a slow call and the whole service stops responding.
Netflix Hystrix popularized the pattern (now Resilience4j in the JVM world). Envoy / Istio have it built into the service mesh. Pattern: closed → open after N failures → half-open after timeout → closed if probe succeeds.
state = "CLOSED"; failures = 0
def call(fn):
if state == "OPEN" and now() < open_until: raise Tripped()
if state == "OPEN": state = "HALF_OPEN"
try:
result = fn()
failures = 0; state = "CLOSED"
return result
except:
failures += 1
if failures >= THRESHOLD:
state = "OPEN"; open_until = now() + COOLDOWN
raiseGraceful Degradation
When a non-critical dependency fails, return a useful partial response instead of erroring the whole request.
Composite responses where some fields are nice-to-have. Recommendations down → show the page without them. Search facets down → return results, drop facets.
Amazon's product page shows the product even if reviews, recommendations, or 'also bought' services time out. Each component has a fallback (empty list, cached snapshot, or skeleton).
Approximation
4Bloom Filter
Probabilistic set membership in tiny memory. No false negatives; tunable false-positive rate.
When 'might be there' is good enough and you want to short-circuit an expensive lookup. Reading a row that doesn't exist is wasted I/O; a bloom filter says 'definitely not, skip it.'
Cassandra puts a bloom filter in front of every SSTable to skip reads that would miss. Chrome historically used one for the malicious-URL list (millions of URLs, ~2MB).
HyperLogLog
Estimate cardinality (unique count) of a massive set using a few KB of memory and a small error margin.
Counting unique visitors, distinct query terms, or unique IPs across billions of events. Exact counts require O(n) memory; HLL gets you within ~1% in O(log log n).
Redis's `PFADD` / `PFCOUNT` commands. Used by ad networks for daily unique reach, by BigQuery for `APPROX_COUNT_DISTINCT`.
Count-Min Sketch
Estimate frequency of items in a stream using sublinear memory; never underestimates.
Heavy-hitter detection in streams: 'which user is making the most requests right now?' Real exact counters would explode memory at internet scale.
Used in Cloudflare's DDoS detection, in Twitter's heavy-hitter analytics, and as a building block for top-K queries on streams.
Reservoir Sampling
Sample k items uniformly at random from a stream of unknown (and unboundedly large) length.
When the dataset is too large or unbounded to hold in memory, and you need a representative random subsample. Logs sampling, A/B testing of streaming events, telemetry.
Distributed tracers (Jaeger, OpenTelemetry) use reservoir sampling to keep a representative slice of traces under a fixed memory budget.
def reservoir(stream, k):
sample = []
for i, item in enumerate(stream):
if i < k:
sample.append(item)
else:
j = random.randint(0, i)
if j < k:
sample[j] = item
return sampleIdentifiers & Ordering
5Snowflake IDs
64-bit unique IDs: [timestamp][machine-id][sequence]. Time-ordered, distributed, no central coordinator.
Need globally unique IDs that are also roughly sortable by creation time, across many shards/machines. Beats UUIDs for index locality (better B-tree fanout).
Twitter coined Snowflake in 2010 for tweet IDs. Discord uses a variant for messages. Sonyflake (Sony) and Instagram's ID gen are spiritual descendants.
// 41 bits timestamp | 10 bits machine | 12 bits sequence
long id = ((now - EPOCH) << 22) | (machineId << 12) | (sequence++ & 0xFFF);UUID v7
Timestamp-prefixed UUID — sortable, monotonic, still 128-bit and globally unique.
When you want UUID's uniqueness guarantees but also need DB index locality. UUID v4 randomness wrecks B-tree caches; v7 keeps recent writes hot.
Postgres 16 partitioning works far better with UUID v7 keys than v4 because adjacent inserts land on the same page. Many ORMs (Prisma, Drizzle) recommend v7 for new tables.
Vector Clocks
Each node maintains a counter; events carry a vector of counters that captures partial causal ordering.
Detecting concurrent updates in a distributed store. Two versions with incomparable vectors are concurrent (a conflict); one strictly dominates the other if its vector is ≥ in every position.
Riak and DynamoDB use vector clocks (Dynamo paper). When you `GET` a key and see two versions, you reconcile — same primitive amazon.com uses for shopping cart merge.
Lamport Timestamps
Single counter per node, advanced to max(local, received) + 1 on every message. Gives total ordering of causally related events.
When you need 'happens-before' ordering across distributed events but don't need to detect concurrency (just to break ties consistently).
Foundational primitive behind Paxos, Raft, and many CRDTs. Cassandra uses Lamport-style timestamps for last-write-wins reconciliation.
Monotonic Counter Service
A dedicated service hands out strictly increasing 64-bit IDs. Simpler than Snowflake when you don't need offline generation.
URL shortener short codes, order numbers, anywhere you need dense sequential IDs (not random). Trade-off: single point of contention unless you shard it.
Bitly's hash-vs-counter hybrid. Internally, many systems pre-allocate ranges (e.g., 'this server reserves IDs 1M-2M') so the central service is hit infrequently.
Throughput & Flow
6Backpressure
When a downstream component is slow, signal upstream to slow down (don't just buffer and OOM).
Anywhere a producer can outrun a consumer: streams, queues, network sockets, UI event loops. Without backpressure, slowness becomes outages.
Node.js streams expose `.write()` returning false; Reactive Streams (RxJava, Project Reactor) make backpressure first-class via `request(n)`. Kafka consumers use offset commits as natural backpressure.
Batching
Coalesce many small operations into one large one to amortize fixed per-request overhead.
RPC overhead dominating real work. Network round trips, DB inserts, S3 puts, log writes. Trade latency (wait for batch to fill) for throughput.
Kafka producer's `linger.ms` waits up to N ms to fill a batch. DynamoDB's BatchWriteItem can do 25 puts per call. GraphQL's DataLoader pattern is request-scoped batching.
Debounce
Wait until a flurry of events stops for N ms, then act once. Drops everything in between.
User input that you only want to act on once they pause — search-as-you-type firing a query, window resize firing a re-layout.
Search input firing the API call 300ms after the last keystroke. React libraries use `useDebouncedValue` for this. Different from throttle (throttle still fires periodically).
function debounce(fn, ms) {
let t;
return (...args) => {
clearTimeout(t);
t = setTimeout(() => fn(...args), ms);
};
}Throttle
Emit at most once per N ms regardless of how often it's called. Unlike debounce, never goes silent.
Scroll handlers, mouse-move analytics, anything that fires constantly where you want a steady downsampled signal rather than 'wait for stop.'
Scroll-based animations throttled to 60fps (16ms). Analytics 'on page heartbeat' fired at most every 5s. Rate-limiting outbound API calls.
Token Bucket
Bucket refills at rate R; each request takes one token. Allows controlled bursts up to bucket size.
Rate limiting where occasional bursts are fine. Most general-purpose throttle: e.g., 100 req/min average, allow 30 in a burst.
AWS API Gateway, Stripe, Cloudflare all use token bucket variants. Linux's tc traffic shaper uses it for QoS.
class TokenBucket:
def __init__(self, capacity, refill_per_sec):
self.capacity = capacity
self.tokens = capacity
self.refill = refill_per_sec
self.last = time.time()
def allow(self):
now = time.time()
self.tokens = min(self.capacity,
self.tokens + (now - self.last) * self.refill)
self.last = now
if self.tokens >= 1:
self.tokens -= 1
return True
return FalseLeaky Bucket
Requests enter a queue, processed at fixed rate. Smooths out bursts entirely — no spike ever reaches downstream.
When the downstream cannot handle bursts at all (legacy DB, third-party with strict QPS). Token bucket allows bursts; leaky bucket guarantees a smooth rate.
Outbound webhook delivery: enqueue, drain at N per second so the receiver isn't slammed. Many billing systems use it for outbound API calls to providers.
Coordination
6Leases + TTL
Grant time-bounded exclusive ownership. Holder must renew before expiry; if it dies, the lease auto-releases.
Leader election, distributed locks, work assignment. Avoids 'zombie owner' problems where a crashed node holds a lock forever.
Kubernetes leader election (controllers all try to grab a lease in a ConfigMap, renew every few seconds). Chubby and etcd are built around lease primitives.
Quorum Reads/Writes
Write to W replicas, read from R replicas, with W + R > N. Guarantees any read sees at least one of the writes.
Tunable consistency in a replicated store. Lower W and R = faster + cheaper but eventual; higher = stronger consistency but slower and less available.
Cassandra and DynamoDB expose read/write consistency per request. Spanner and CockroachDB use Paxos/Raft (a flavor of quorum) for every write.
Read Repair
On a quorum read, if replicas disagree, write the freshest version back to the stale ones in the background.
Eventually-consistent stores where you tolerate some stale replicas but want them to converge without a separate sweep process.
Cassandra and Dynamo do this on every read above CL.ONE. Combined with anti-entropy, replicas converge even without explicit reconciliation.
Anti-Entropy / Merkle Sync
Background process that compares replicas using a Merkle tree of data and ships only the differing ranges.
Keeping replicas in sync when read-repair alone won't reach cold data. Catch-up after a node was offline for a long time.
Cassandra's nodetool repair builds Merkle trees per range. DynamoDB does similar. Riak and Bitcoin/Ethereum P2P sync use the same Merkle-tree-difference trick.
Heartbeats
Periodic 'I'm alive' signal. Absence of N consecutive heartbeats means the peer is presumed dead.
Failure detection in any distributed system: cluster membership, load balancers, service mesh. Foundation for failover.
Kubernetes kubelet → control plane heartbeats. ZooKeeper ephemeral nodes are heartbeat-tied. Akka clusters, Cassandra gossip, Redis Sentinel — all heartbeat-based.
Optimistic Concurrency / CAS
Read a value with its version, send update conditional on version unchanged. Server rejects if someone else wrote in the meantime.
Multiple writers to the same row, but contention is rare. Avoids the throughput hit of pessimistic locks; conflicts surface as 'retry the operation.'
DynamoDB's `ConditionExpression`, S3's `If-Match` ETag, HTTP PUT with `If-Match`, Git's push (rejects non-fast-forward). Postgres MVCC + SELECT FOR UPDATE NOWAIT is a hybrid.