Techniques

Techniques

Production idioms — the smart tricks you weave into algorithms and systems. Exponential backoff, idempotency keys, reservoir sampling, Snowflake IDs, leases, and more.

Grouped by category · 27 entries

Reliability

6

Exponential Backoff + Jitter

Grow retry intervals exponentially and add randomization so clients don't synchronize their retries.

When to use

Calling any unreliable downstream (network, third-party API, flaky DB). Without jitter, a thundering-herd of clients all retry at the same moment after an outage.

In the wild

AWS SDK clients retry with full jitter: sleep = random(0, base × 2^attempt). Both Google and Stripe SDKs follow the same pattern.

Sketchpython
def retry(fn, max_attempts=5, base_ms=100, cap_ms=10_000):
    for attempt in range(max_attempts):
        try:
            return fn()
        except TransientError:
            if attempt == max_attempts - 1:
                raise
            backoff = min(cap_ms, base_ms * 2 ** attempt)
            jitter = random.uniform(0, backoff)
            time.sleep(jitter / 1000)

Idempotency Keys

Caller sends a unique key with each request; server dedupes on it so retries don't double-charge.

When to use

Any non-idempotent mutation that might be retried: payments, signups, order creation. Network errors mid-flight are normal; you don't know whether the server committed.

In the wild

Stripe's `Idempotency-Key` header. Server stores key → response for 24h; second request with the same key returns the cached response instead of re-running the work.

Hedged Requests

If the first request hasn't returned by some threshold (p95), fire a second one and take whichever finishes first.

When to use

Tail-latency-sensitive reads where the long tail is wasted (e.g., a single slow replica). Costs ~2x reads on the tail but slashes p99 latency.

In the wild

Google's BigTable + Spanner read clients hedge to a second replica after ~95th percentile of recent latency. Used internally by search to keep query latency consistent.

Deadline / Retry Budget

Bound the total time spent retrying, not the number of attempts. Each retry costs from the same budget.

When to use

Any call inside a larger request that itself has a timeout. Counting attempts is the wrong unit: 5 retries with backoff can blow past your caller's 200ms SLA.

In the wild

gRPC's context deadlines. Each downstream RPC inherits a deadline; retries must finish before it expires. Envoy and Istio implement the same with a `x-envoy-upstream-rq-timeout-ms` header.

Circuit Breaker

Track failures from a dependency; once they exceed a threshold, open the circuit and fail fast for a cool-down period.

When to use

Calling a dependency that can degrade your own health if you keep retrying. Without a breaker, your threads pile up on a slow call and the whole service stops responding.

In the wild

Netflix Hystrix popularized the pattern (now Resilience4j in the JVM world). Envoy / Istio have it built into the service mesh. Pattern: closed → open after N failures → half-open after timeout → closed if probe succeeds.

Sketchpython
state = "CLOSED"; failures = 0
def call(fn):
    if state == "OPEN" and now() < open_until: raise Tripped()
    if state == "OPEN": state = "HALF_OPEN"
    try:
        result = fn()
        failures = 0; state = "CLOSED"
        return result
    except:
        failures += 1
        if failures >= THRESHOLD:
            state = "OPEN"; open_until = now() + COOLDOWN
        raise

Graceful Degradation

When a non-critical dependency fails, return a useful partial response instead of erroring the whole request.

When to use

Composite responses where some fields are nice-to-have. Recommendations down → show the page without them. Search facets down → return results, drop facets.

In the wild

Amazon's product page shows the product even if reviews, recommendations, or 'also bought' services time out. Each component has a fallback (empty list, cached snapshot, or skeleton).

Approximation

4

Probabilistic set membership in tiny memory. No false negatives; tunable false-positive rate.

When to use

When 'might be there' is good enough and you want to short-circuit an expensive lookup. Reading a row that doesn't exist is wasted I/O; a bloom filter says 'definitely not, skip it.'

In the wild

Cassandra puts a bloom filter in front of every SSTable to skip reads that would miss. Chrome historically used one for the malicious-URL list (millions of URLs, ~2MB).

HyperLogLog

Estimate cardinality (unique count) of a massive set using a few KB of memory and a small error margin.

When to use

Counting unique visitors, distinct query terms, or unique IPs across billions of events. Exact counts require O(n) memory; HLL gets you within ~1% in O(log log n).

In the wild

Redis's `PFADD` / `PFCOUNT` commands. Used by ad networks for daily unique reach, by BigQuery for `APPROX_COUNT_DISTINCT`.

Count-Min Sketch

Estimate frequency of items in a stream using sublinear memory; never underestimates.

When to use

Heavy-hitter detection in streams: 'which user is making the most requests right now?' Real exact counters would explode memory at internet scale.

In the wild

Used in Cloudflare's DDoS detection, in Twitter's heavy-hitter analytics, and as a building block for top-K queries on streams.

Reservoir Sampling

Sample k items uniformly at random from a stream of unknown (and unboundedly large) length.

When to use

When the dataset is too large or unbounded to hold in memory, and you need a representative random subsample. Logs sampling, A/B testing of streaming events, telemetry.

In the wild

Distributed tracers (Jaeger, OpenTelemetry) use reservoir sampling to keep a representative slice of traces under a fixed memory budget.

Sketchpython
def reservoir(stream, k):
    sample = []
    for i, item in enumerate(stream):
        if i < k:
            sample.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                sample[j] = item
    return sample

Identifiers & Ordering

5

Snowflake IDs

64-bit unique IDs: [timestamp][machine-id][sequence]. Time-ordered, distributed, no central coordinator.

When to use

Need globally unique IDs that are also roughly sortable by creation time, across many shards/machines. Beats UUIDs for index locality (better B-tree fanout).

In the wild

Twitter coined Snowflake in 2010 for tweet IDs. Discord uses a variant for messages. Sonyflake (Sony) and Instagram's ID gen are spiritual descendants.

Sketchjava
// 41 bits timestamp | 10 bits machine | 12 bits sequence
long id = ((now - EPOCH) << 22) | (machineId << 12) | (sequence++ & 0xFFF);

UUID v7

Timestamp-prefixed UUID — sortable, monotonic, still 128-bit and globally unique.

When to use

When you want UUID's uniqueness guarantees but also need DB index locality. UUID v4 randomness wrecks B-tree caches; v7 keeps recent writes hot.

In the wild

Postgres 16 partitioning works far better with UUID v7 keys than v4 because adjacent inserts land on the same page. Many ORMs (Prisma, Drizzle) recommend v7 for new tables.

Vector Clocks

Each node maintains a counter; events carry a vector of counters that captures partial causal ordering.

When to use

Detecting concurrent updates in a distributed store. Two versions with incomparable vectors are concurrent (a conflict); one strictly dominates the other if its vector is ≥ in every position.

In the wild

Riak and DynamoDB use vector clocks (Dynamo paper). When you `GET` a key and see two versions, you reconcile — same primitive amazon.com uses for shopping cart merge.

Lamport Timestamps

Single counter per node, advanced to max(local, received) + 1 on every message. Gives total ordering of causally related events.

When to use

When you need 'happens-before' ordering across distributed events but don't need to detect concurrency (just to break ties consistently).

In the wild

Foundational primitive behind Paxos, Raft, and many CRDTs. Cassandra uses Lamport-style timestamps for last-write-wins reconciliation.

Monotonic Counter Service

A dedicated service hands out strictly increasing 64-bit IDs. Simpler than Snowflake when you don't need offline generation.

When to use

URL shortener short codes, order numbers, anywhere you need dense sequential IDs (not random). Trade-off: single point of contention unless you shard it.

In the wild

Bitly's hash-vs-counter hybrid. Internally, many systems pre-allocate ranges (e.g., 'this server reserves IDs 1M-2M') so the central service is hit infrequently.

Throughput & Flow

6

Backpressure

When a downstream component is slow, signal upstream to slow down (don't just buffer and OOM).

When to use

Anywhere a producer can outrun a consumer: streams, queues, network sockets, UI event loops. Without backpressure, slowness becomes outages.

In the wild

Node.js streams expose `.write()` returning false; Reactive Streams (RxJava, Project Reactor) make backpressure first-class via `request(n)`. Kafka consumers use offset commits as natural backpressure.

Batching

Coalesce many small operations into one large one to amortize fixed per-request overhead.

When to use

RPC overhead dominating real work. Network round trips, DB inserts, S3 puts, log writes. Trade latency (wait for batch to fill) for throughput.

In the wild

Kafka producer's `linger.ms` waits up to N ms to fill a batch. DynamoDB's BatchWriteItem can do 25 puts per call. GraphQL's DataLoader pattern is request-scoped batching.

Debounce

Wait until a flurry of events stops for N ms, then act once. Drops everything in between.

When to use

User input that you only want to act on once they pause — search-as-you-type firing a query, window resize firing a re-layout.

In the wild

Search input firing the API call 300ms after the last keystroke. React libraries use `useDebouncedValue` for this. Different from throttle (throttle still fires periodically).

Sketchjavascript
function debounce(fn, ms) {
  let t;
  return (...args) => {
    clearTimeout(t);
    t = setTimeout(() => fn(...args), ms);
  };
}

Throttle

Emit at most once per N ms regardless of how often it's called. Unlike debounce, never goes silent.

When to use

Scroll handlers, mouse-move analytics, anything that fires constantly where you want a steady downsampled signal rather than 'wait for stop.'

In the wild

Scroll-based animations throttled to 60fps (16ms). Analytics 'on page heartbeat' fired at most every 5s. Rate-limiting outbound API calls.

Bucket refills at rate R; each request takes one token. Allows controlled bursts up to bucket size.

When to use

Rate limiting where occasional bursts are fine. Most general-purpose throttle: e.g., 100 req/min average, allow 30 in a burst.

In the wild

AWS API Gateway, Stripe, Cloudflare all use token bucket variants. Linux's tc traffic shaper uses it for QoS.

Sketchpython
class TokenBucket:
    def __init__(self, capacity, refill_per_sec):
        self.capacity = capacity
        self.tokens = capacity
        self.refill = refill_per_sec
        self.last = time.time()

    def allow(self):
        now = time.time()
        self.tokens = min(self.capacity,
                          self.tokens + (now - self.last) * self.refill)
        self.last = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Requests enter a queue, processed at fixed rate. Smooths out bursts entirely — no spike ever reaches downstream.

When to use

When the downstream cannot handle bursts at all (legacy DB, third-party with strict QPS). Token bucket allows bursts; leaky bucket guarantees a smooth rate.

In the wild

Outbound webhook delivery: enqueue, drain at N per second so the receiver isn't slammed. Many billing systems use it for outbound API calls to providers.

Coordination

6

Grant time-bounded exclusive ownership. Holder must renew before expiry; if it dies, the lease auto-releases.

When to use

Leader election, distributed locks, work assignment. Avoids 'zombie owner' problems where a crashed node holds a lock forever.

In the wild

Kubernetes leader election (controllers all try to grab a lease in a ConfigMap, renew every few seconds). Chubby and etcd are built around lease primitives.

Quorum Reads/Writes

Write to W replicas, read from R replicas, with W + R > N. Guarantees any read sees at least one of the writes.

When to use

Tunable consistency in a replicated store. Lower W and R = faster + cheaper but eventual; higher = stronger consistency but slower and less available.

In the wild

Cassandra and DynamoDB expose read/write consistency per request. Spanner and CockroachDB use Paxos/Raft (a flavor of quorum) for every write.

Read Repair

On a quorum read, if replicas disagree, write the freshest version back to the stale ones in the background.

When to use

Eventually-consistent stores where you tolerate some stale replicas but want them to converge without a separate sweep process.

In the wild

Cassandra and Dynamo do this on every read above CL.ONE. Combined with anti-entropy, replicas converge even without explicit reconciliation.

Anti-Entropy / Merkle Sync

Background process that compares replicas using a Merkle tree of data and ships only the differing ranges.

When to use

Keeping replicas in sync when read-repair alone won't reach cold data. Catch-up after a node was offline for a long time.

In the wild

Cassandra's nodetool repair builds Merkle trees per range. DynamoDB does similar. Riak and Bitcoin/Ethereum P2P sync use the same Merkle-tree-difference trick.

Heartbeats

Periodic 'I'm alive' signal. Absence of N consecutive heartbeats means the peer is presumed dead.

When to use

Failure detection in any distributed system: cluster membership, load balancers, service mesh. Foundation for failover.

In the wild

Kubernetes kubelet → control plane heartbeats. ZooKeeper ephemeral nodes are heartbeat-tied. Akka clusters, Cassandra gossip, Redis Sentinel — all heartbeat-based.

Optimistic Concurrency / CAS

Read a value with its version, send update conditional on version unchanged. Server rejects if someone else wrote in the meantime.

When to use

Multiple writers to the same row, but contention is rare. Avoids the throughput hit of pessimistic locks; conflicts surface as 'retry the operation.'

In the wild

DynamoDB's `ConditionExpression`, S3's `If-Match` ETag, HTTP PUT with `If-Match`, Git's push (rejects non-fast-forward). Postgres MVCC + SELECT FOR UPDATE NOWAIT is a hybrid.