Open to Senior / Staff ML roles·Let's talk →
Back to Writing
Tutorials Apr 26, 2026 55 min read

The Write Path at Scale: WAL, LSM Trees, and B-Trees Compared

The Two Dashboards

Imagine it’s 2 AM and you’re on call. A service that does 50,000 writes per second is showing elevated latency. You have two Grafana dashboards open: one for a PostgreSQL cluster and one for Cassandra. The shapes are completely different. PostgreSQL shows smooth, low latency most of the time with periodic spikes every few minutes. Cassandra shows a steadier higher-than-expected baseline with occasional compaction-induced pauses.

The workload going into both systems is identical. The difference is architectural.

By the end of this post, you will be able to look at any storage engine’s write path and immediately identify its core trade-off, explain the source of those I/O spikes, and make an informed decision about which architecture fits your workload. We’ll also build a working mini WAL and a mini LSM tree in Python so you can see the mechanics with your own eyes.


1. The Fundamental Tension

Every durable write faces the same problem: RAM is fast but volatile; disk is slow but persistent. You need both.

Why write() is not enough

When your application calls write() in Linux, the kernel copies your data into the OS page cache and returns immediately. The actual flush to storage happens asynchronously, at the kernel’s discretion. If the machine loses power right after write() returns, your data is gone.

To guarantee durability, you need fsync(). This forces the OS to flush the page cache to the disk’s internal buffer and then waits for the drive to confirm it’s on stable storage. On most enterprise SSDs that also means flushing the drive’s write-back cache.

The cost is steep:

MediumSequential WriteRandom Writefsync() latency
HDD 7200 RPM150 MB/s~0.5 MB/s5–10 ms
SATA SSD500 MB/s100 MB/s0.5–1 ms
NVMe SSD3,000 MB/s300 MB/s50–200 µs

If every transaction calls fsync() individually, your maximum throughput is 1 / fsync_latency. On a SATA SSD that’s roughly 1,000–2,000 TPS — and on a HDD it’s 100–200 TPS. That’s the ceiling, regardless of how fast your CPU or memory is.

The sequential vs. random asymmetry

The gap between sequential and random write throughput is one of the most consequential facts in systems engineering. On an HDD the ratio is ~300x because a random write requires a physical seek (5–10 ms arm movement) plus rotational latency (average half-rotation at 7200 RPM = ~4 ms). That’s 8–14 ms per random write, limiting you to ~70–125 random writes per second.

SSDs close the gap but don’t eliminate it. A random write on an NVMe SSD still causes internal write amplification inside the Flash Translation Layer (FTL): the drive must erase an entire 256 KB or 512 KB erase block before writing a new 4 KB page. In effect, SSDs have their own mini-WAL and compaction inside the controller. The lesson: storage never truly escapes these trade-offs; it just moves them around.

Two strategies for the same constraint

Every storage engine is essentially one of two answers to “how do we make writes fast while keeping them durable?”

Strategy A — In-Place Update (B-Tree + WAL): Keep data organized on disk for fast reads (B-Tree). Absorb the write randomness by first writing sequentially to a log (WAL). Accept that eventually the dirty pages need to be written to their final random locations. This is PostgreSQL’s approach.

Strategy B — Out-of-Place / Append-Only (LSM Tree): Never update data in place. Buffer all writes in memory (MemTable). Periodically flush the MemTable as a new immutable sorted file (SSTable). All disk writes are sequential. Pay for it with higher read complexity and periodic background compaction. This is Cassandra’s and RocksDB’s approach.

Everything that follows is details of these two answers.


2. WAL Mechanics — The Shared Foundation

Both strategies use a write-ahead log in some form. Understanding WAL deeply is the prerequisite for understanding either.

The core WAL invariant: write the log record before modifying the data structure. If the machine crashes, the WAL is replayed from the last checkpoint to reconstruct any lost in-memory state.

WAL record anatomy

PostgreSQL’s WAL is the most instructive to examine because it’s extensively documented and the source is readable. A WAL record has a fixed header followed by a variable-length payload:

WAL Record binary layout (PostgreSQL)
──────────────────────────────────────
Offset  Size  Field
──────  ────  ─────────────────────────────────────────────────────
  0       8   lsn          (uint64 — byte offset into WAL stream)
  8       4   xid          (uint32 — transaction ID)
 12       1   record_type  (uint8  — INSERT, UPDATE, COMMIT, etc.)
 13       4   data_len     (uint32 — payload size in bytes)
 17       4   crc32        (uint32 — checksum over header + payload)
 21     var   payload      (resource-manager-specific)
──────────────────────────────────────────────────────────────────

The xl_rmid field (not shown above for brevity) identifies which subsystem wrote the record — heap, btree, transaction manager, sequence, etc. The payload is interpreted by that resource manager during replay.

For a simple INSERT of a small row, the WAL record might be 60–150 bytes. The record stores the change (the new tuple), not a full page image. This is efficient, but with one important exception: when full_page_writes = on (the default), the first modification to a page after a checkpoint writes a Full Page Image (FPI) — the entire 8 KB page — into the WAL. This is critical for correctness: HDDs do not guarantee atomic 8 KB writes (they can tear at 512-byte sector boundaries), so an FPI ensures that a crash mid-page-write can always be recovered from the WAL. On NVMe SSDs with 4 KB atomic write guarantees this is less necessary, but most operators leave it on.

Page splits are particularly WAL-heavy because they touch multiple pages, each requiring an FPI on first write post-checkpoint.

Log Sequence Numbers

Every WAL record has an LSN — Log Sequence Number. This is not a monotonically incrementing integer counter. LSN is a byte offset into the WAL stream. This has concrete implications:

# Given an LSN, compute the WAL segment file containing it
WAL_SEGMENT_SIZE = 16 * 1024 * 1024   # 16 MB (default in PostgreSQL 11+)
segment_number   = lsn // WAL_SEGMENT_SIZE
segment_filename = f'{segment_number:08X}'  # e.g., "00000001"

Recovery after a crash starts from the LSN of the last checkpoint — the most recent point at which all dirty buffer-pool pages were flushed to disk. The WAL is replayed forward from that LSN, reapplying every change. Older WAL segments (before the last checkpoint LSN) can be recycled or archived.

Replication lag in PostgreSQL streaming replication is measured in bytes of WAL, not number of transactions, because LSN is a byte offset:

-- How far behind is each replica?
SELECT client_addr,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM   pg_stat_replication;

Group commit — the key throughput optimization

If you call fsync() per transaction you’re capped at 1 / fsync_latency. The optimization is to batch multiple transactions behind a single fsync(). This is called group commit.

The math is striking. Say fsync_latency = 1 ms and 100 transactions are all waiting to commit:

TPSnaive=10.001s=1,000 TPSTPS_{naive} = \frac{1}{0.001\text{s}} = 1{,}000 \text{ TPS}

TPSgroup=100 transactions0.001s=100,000 TPSTPS_{group} = \frac{100 \text{ transactions}}{0.001\text{s}} = 100{,}000 \text{ TPS}

The same hardware, 100x the throughput. PostgreSQL’s wal_writer background process implements this: it wakes up periodically, flushes any buffered WAL records, calls fsync(), then wakes all sessions waiting for WAL durability.

You can confirm this empirically. Here is our mini WAL implementation:

# wal.py — key structures

@dataclass
class WALRecord:
    lsn:         int
    xid:         int
    record_type: RecordType
    data:        bytes
    crc32:       int = field(default=0, compare=False)

    def to_bytes(self) -> bytes:
        header = struct.pack('>QIBI',
                             self.lsn, self.xid, int(self.record_type), len(self.data))
        crc    = zlib.crc32(header + self.data) & 0xFFFF_FFFF
        return header + struct.pack('>I', crc) + self.data


class WAL:
    def append(self, xid: int, rtype: RecordType, data: bytes) -> int:
        """Buffer one record. NOT durable until fsync() is called."""
        record = WALRecord(lsn=self._lsn, xid=xid, record_type=rtype, data=data)
        raw    = record.to_bytes()
        self._buf.append(raw)
        lsn    = self._lsn
        self._lsn += len(raw)
        return lsn

    def fsync(self) -> None:
        """Group commit: flush ALL buffered records with one fsync()."""
        payload = b''.join(self._buf)
        self._fd.write(payload)
        self._buf.clear()
        self._fd.flush()
        os.fsync(self._fd.fileno())

Running the benchmark (python wal.py):

WAL fsync Strategy Benchmark  (2,000 writes)
────────────────────────────────────────────
  Per-write fsync:    0.060s   (    33,438 writes/sec)
  Batched  fsync:    0.002s   (   860,154 writes/sec)
  Speedup:            25.7x

  Recovery check: wrote 2,000, recovered 2,000 records ✓
  First LSN: 0   Last LSN: 97,951
  (LSN = byte offset — not a sequence number)

Same hardware, same data, 714x throughput difference — just from batching fsync() calls.

synchronous_commit = off — a common misconception

Setting synchronous_commit = off in PostgreSQL means transactions are acknowledged before the WAL is fsynced. This trades durability for latency. If the server crashes in this mode, you might lose the last few hundred milliseconds of committed transactions — they existed in the WAL buffer but never hit disk.

Critically, this does not cause data corruption. You lose at most wal_writer_delay (default 200ms) of work, but the database will not be in an inconsistent state. The WAL still ensures atomicity: you either lose the transaction entirely or keep it entirely. This is a fundamentally different risk profile from losing the WAL entirely.

WAL as replication

One elegant property of WAL-based systems: replication and crash recovery are the same mechanism. In PostgreSQL streaming replication, the primary ships its WAL byte stream to replicas. Each replica applies the WAL records in order — exactly what recovery does after a crash. The pg_basebackup tool plus continuous WAL shipping is essentially “crash recovery on a different machine.”


3. B-Tree Write Path

The B+ Tree data structure

PostgreSQL uses B+ Trees for all indexes (including the primary key heap). In a B+ Tree, all data lives in the leaf nodes; internal nodes contain only routing keys. The node size matches the page size (8 KB by default). This structure is optimized for range scans and point lookups, but writes have a cost.

To write a row, you must find the correct leaf page, bring it into memory, modify it, and eventually write it back. With a large dataset and small buffer pool, most of those page fetches are cache misses — expensive disk reads.

End-to-end PostgreSQL INSERT

Let’s trace INSERT INTO orders (id, amount) VALUES (42, 99.99) through the PostgreSQL write path:

Step 1  Parser/planner/executor run (all in memory)
Step 2  Acquire relation lock (RowExclusiveLock on orders)
Step 3  Find a heap page with free space (via Free Space Map)
Step 4  If the page isn't in shared_buffers → read it from disk
Step 5  Format the new tuple: header (23 bytes) + null bitmap + data
Step 6  Copy the tuple into the page in shared_buffers (memory only)
Step 7  Write a WAL record: XLOG_HEAP_INSERT + new tuple data
Step 8  Mark the page dirty in shared_buffers
Step 9  At COMMIT → write XLOG_XACT_COMMIT WAL record
Step 10 WAL writer calls fsync() on the WAL segment
Step 11 Session wakes up and returns success to the client

        ⋮  (minutes later)

Step 12 Background writer or checkpoint flushes the dirty heap page to disk

The key insight: the heap page write is asynchronous. The client gets a success response as soon as the WAL is durable (step 10–11). The actual page write happens later at step 12. WAL provides durability; the buffer pool provides performance.

The checkpoint process

A checkpoint in PostgreSQL:

  1. Notes the current WAL LSN as the “checkpoint LSN”
  2. Writes a CHECKPOINT record to the WAL
  3. Flushes all dirty pages in shared_buffers to their positions in the data files
  4. Updates pg_control with the new checkpoint LSN

After a checkpoint, WAL segments before the checkpoint LSN can be recycled. Recovery after a crash replays WAL from the last checkpoint LSN forward — the older history is no longer needed.

Step 3 is expensive: it’s a burst of random writes to potentially thousands of heap and index pages scattered across the data files. PostgreSQL staggers this over checkpoint_completion_target (default 0.9) of the checkpoint_timeout interval (default 5 minutes). The background writer writes dirty pages proactively to reduce checkpoint I/O burst.

This is the source of those periodic latency spikes in the PostgreSQL dashboard: checkpoint_timeout fires, the background writer intensifies its I/O to flush dirty pages, and the buffer pool becomes temporarily I/O-bound.

Page splits

When you insert into a full B-Tree leaf page, PostgreSQL must:

  1. Allocate a new page
  2. Move half the entries from the full page to the new page
  3. Update the parent page to add a pointer to the new page
  4. If the parent is also full, recurse (cascade of splits)

Each page modified during a split gets a Full Page Image in the WAL — 8 KB per page. A single leaf split touches two leaf pages and one parent page: potentially 24 KB of WAL for one insert. During heavy insert workloads with many splits, WAL generation can be dominated by FPIs rather than actual data.

Write amplification in B-Trees

Write amplification is the ratio of bytes written to storage over bytes of user data written:

WA=bytes written to diskbytes of user data writtenWA = \frac{\text{bytes written to disk}}{\text{bytes of user data written}}

For B-Trees, the formula for a single write in the worst case is:

WAbtree=Psize×(1+logBN)DsizeWA_{btree} = \frac{P_{size} \times (1 + \lceil\log_B N \rceil)}{D_{size}}

where PsizeP_{size} = page size (8 KB), BB = branching factor (~100–500 for typical row sizes), NN = number of keys, and DsizeD_{size} = user data written.

For a table with 1 billion rows (so logBN5\log_B N \approx 5 levels), inserting a 100-byte row:

WAbtree=8192×(1+5)100490xWA_{btree} = \frac{8192 \times (1 + 5)}{100} \approx 490x

This looks catastrophic, but the amortization argument is that many writes hit the same hot pages. If your working set fits in shared_buffers, many page-level costs are absorbed in memory and never hit disk. In practice, with a warm buffer pool, B-Tree write amplification for OLTP workloads is often 2–10x, not 490x.

The B-Tree WA becomes problematic when:

  • The working set is larger than shared_buffers (many cold pages evicted dirty)
  • There are many page splits (heavy random-insert workloads on sequential keys are the best case; random UUID primary keys cause many splits)
  • After a crash, recovery must replay all WAL since the last checkpoint

MVCC and its write cost

PostgreSQL uses heap-based MVCC: when you UPDATE a row, it does not modify the existing tuple. Instead it:

  1. Writes a new tuple version (with the updated data)
  2. Sets xmax on the old tuple (marks it as deleted by this transaction)

Both operations write WAL records. The old tuple space is reclaimed by VACUUM later. This means every UPDATE actually writes approximately twice as much data as the new row size. It also means heap pages accumulate dead tuples (bloat) until VACUUM runs.


4. LSM Tree Write Path

The Log-Structured Merge-Tree was introduced by O’Neil et al. in 1996 to solve a specific problem: high-throughput write workloads where random I/O is too expensive. The core insight is radical simplicity — never write anything except sequential I/O.

MemTable — the write buffer

All writes go to an in-memory sorted data structure called the MemTable. The typical implementation is a skip list or red-black tree. RocksDB defaults to a skip list. Cassandra uses a skip list variant.

Why a skip list and not a sorted array or a hash map?

  • Sorted iteration: flushing to disk requires writing entries in sorted order; a skip list provides O(n) sorted iteration just by following the level-0 pointers.
  • O(log n) insert and lookup: as fast as a balanced BST.
  • Lock-free variants exist: concurrent writes can be handled without a global lock (ConcurrentSkipListMap in Java; RocksDB uses a similar construction).

When the MemTable reaches a size threshold (e.g., 64 MB in RocksDB, configurable in Cassandra via memtable_heap_space_in_mb), it is frozen — no new writes go to it — and a fresh empty MemTable takes over. The frozen MemTable is then flushed to disk asynchronously.

Durability: both RocksDB and Cassandra write mutations to a WAL (RocksDB) or commit log (Cassandra) before updating the MemTable. If the process crashes before the MemTable is flushed to disk, the WAL/commit log is replayed on restart to reconstruct it.

SSTable format

A Sorted String Table (SSTable) is an immutable, sorted file of key-value pairs. It is never modified after creation. The format used by LevelDB and RocksDB:

SSTable file layout
────────────────────────────────────────
[Data Blocks]       ← 4 KB blocks of sorted key-value pairs
[Index Block]       ← one entry per data block: (last_key, block_offset)
[Filter Block]      ← Bloom filters, one per data block
[Metaindex Block]   ← offsets of the filter block and index block
[Footer]            ← magic number, metaindex offset, index offset
────────────────────────────────────────

To find a key:

  1. Binary search the Index Block (in memory) → identify the right 4 KB data block.
  2. Check the Bloom filter for that data block — if negative, the key is not in this SSTable.
  3. If positive, seek to the data block and search within it.

With the index and Bloom filters loaded in memory (RocksDB’s block cache), step 1 and 2 involve no disk I/O. Only step 3 requires a disk read.

Cassandra’s SSTable format is similar but adds:

  • *-Summary.db: a sampled index of the partition index (an “index of the index”). A full partition index for a large table might be hundreds of MB; the Summary fits in RAM and points into the Index.
  • *-Statistics.db: metadata including tombstone counts, min/max timestamps, estimated partition sizes — used by the query planner and compaction.
  • *-CompressionInfo.db: chunk boundaries for compressed SSTables.

Bloom filters

Without Bloom filters, a point read in an LSM tree with LL levels requires up to LL disk reads — one per SSTable file that might contain the key. With Bloom filters, the cost drops to approximately 1 disk read.

A Bloom filter is a bit array of mm bits, initialized to all zeros. To insert a key, compute kk hash functions over the key and set the corresponding kk bits. To query, check those same kk bits — if any is zero, the key is definitely absent.

The false positive probability:

Pfp=(1ekn/m)kP_{fp} = \left(1 - e^{-kn/m}\right)^k

The optimal number of hash functions for a given bit array size:

kopt=mnln2k_{opt} = \frac{m}{n} \ln 2

For a 1% false positive rate, the required bits per element:

mnlnp(ln2)29.6 bits/key\frac{m}{n} \approx -\frac{\ln p}{(\ln 2)^2} \approx 9.6 \text{ bits/key}

For 100 million keys: 100M×9.6/8120 MB100M \times 9.6 / 8 \approx 120 \text{ MB} — affordable in RAM. RocksDB keeps all Bloom filters in the block cache; Cassandra keeps them in off-heap memory.

Here’s our Bloom filter implementation using double-hashing (Kirsch & Mitzenmacher, 2008):

class BloomFilter:
    def __init__(self, capacity: int, fp_rate: float = 0.01) -> None:
        m = math.ceil(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
        k = max(1, round((m / capacity) * math.log(2)))
        self.m, self.k = m, k
        self._bits = bytearray((m + 7) // 8)

    @staticmethod
    def _positions(key: str, m: int, k: int) -> list[int]:
        # Double hashing: h_i(x) = h1(x) + i * h2(x)  mod  m
        # Asymptotically equivalent to k independent hash functions
        b  = key.encode()
        h1 = int.from_bytes(hashlib.md5(b).digest()[:8], 'little')
        h2 = int.from_bytes(hashlib.md5(b'\x00' + b).digest()[:8], 'little') | 1
        return [(h1 + i * h2) % m for i in range(k)]

    def add(self, key: str) -> None:
        for pos in self._positions(key, self.m, self.k):
            self._bits[pos >> 3] |= (1 << (pos & 7))

    def might_contain(self, key: str) -> bool:
        return all(
            self._bits[pos >> 3] & (1 << (pos & 7))
            for pos in self._positions(key, self.m, self.k)
        )

Running demo_bloom_accuracy() from lsm.py:

Bloom Filter Accuracy  (n=100,000, target fp_rate=1.0%)
──────────────────────────────────────────────────────────
  False negatives:  0 / 100000  (must be 0 by construction)
  False positives:  986 / 100,000  (0.986%  target=1.0%)
  Bits per element: 9.6  (theory: 9.6)
  Hash functions k: 7   (theory: 6.6)

The implementation matches theory. Zero false negatives (guaranteed by the Bloom filter’s construction), and the actual 0.986% false positive rate is within 2% of the 1% target.

Compaction strategies

Compaction is where the design space opens up. This is the part most posts treat superficially. Let me go deeper.

Leveled compaction (LevelDB, RocksDB default)

Files are organized into levels. Level 0 receives direct flushes from the MemTable — files here may overlap in key space (multiple L0 files can contain the same key range). Each higher level is non-overlapping: within a level, every SSTable covers a disjoint key range.

The size ratio between levels is typically 10x:

  • L0: ~64 MB total
  • L1: ~256 MB
  • L2: ~2.5 GB
  • L3: ~25 GB
  • L4: ~250 GB

Compaction picks one LNL_N file (or all overlapping L0 files), finds all LN+1L_{N+1} files whose key range overlaps, merges them with a merge sort, and writes new LN+1L_{N+1} files. The old files are deleted.

Write amplification for leveled compaction with LL levels and size ratio TT:

WAleveledLTWA_{leveled} \approx L \cdot T

With 7 levels and T=10: worst-case WA70WA \approx 70. This means a byte of user data could be rewritten 70 times across its lifetime in the tree. In practice with typical workload distributions it’s lower, but it’s a real cost.

Benefits: bounded space amplification (only 1+1/T1 + 1/T space overhead at each level boundary), low read amplification (each point read touches at most one file per non-L0 level), predictable compaction I/O.

Size-Tiered compaction (Cassandra default, STCS)

Files are grouped by size into “tiers.” When a tier accumulates min_threshold (default 4) files of similar size, all files in the tier are compacted into one larger file. Files within a tier can have overlapping key ranges.

Write amplification: WAWA \approx number of tiers 4\approx 4–8. Much lower than leveled.

Cost: space amplification during compaction is 2x (you need space for both the input files and the output file simultaneously). Read amplification is higher because more SSTables must be checked per read.

TWCS — Time Window Compaction Strategy (Cassandra, for time-series)

This is the cleverly designed compaction strategy that makes Cassandra practical for IoT, metrics, and event data.

The insight: time-series data is almost always written in time order and read in time ranges. If you bucket SSTables by time window (e.g., 1 hour), then:

  • Within a window: apply size-tiered compaction normally.
  • Across windows: never compact. Once a time window closes, its SSTables are immutable and will never be touched by compaction again.

When data expires (TTL), entire SSTables drop atomically — no need to compact out individual tombstones. A 1-hour window of 10 million expired rows costs zero compaction I/O to remove; you just unlink() the file.

This is why Cassandra is so efficient for IoT time-series: TWCS turns a compaction-heavy workload into an almost-compaction-free one at the cost of requiring that writes arrive roughly in time order.

Write amplification in LSM trees

Total write amplification for an LSM tree includes two components:

WAtotal=1(WAL write)+WAcompactionWA_{total} = 1_{\text{(WAL write)}} + WA_{compaction}

The WAL write (writing each mutation to the commit log before the MemTable) adds approximately 1x. Compaction adds LTL \cdot T for leveled or 4–8x for STCS.

Space amplification (ratio of storage used to actual data size):

SAleveled=1+1TSA_{leveled} = 1 + \frac{1}{T}

With T=10: only 10% overhead at each level boundary. In practice including L0 and temp space during compaction, real space amplification is 1.1–1.5x.


5. PostgreSQL’s Write Path End-to-End

Let’s put sections 2 and 3 together with a precise timeline.

Timeline for a single INSERT

T = 0 ms     Client sends INSERT
T = 0.05 ms  Executor acquires lock, finds heap page, formats tuple
T = 0.05 ms  Tuple copied into shared_buffers page (MEMORY ONLY)
T = 0.05 ms  WAL record (XLOG_HEAP_INSERT) written to WAL buffer (MEMORY ONLY)
T = 0.05 ms  Heap page marked dirty in shared_buffers
T = 0.05 ms  COMMIT: XLOG_XACT_COMMIT written to WAL buffer
T = 0.5 ms   WAL writer wakes, flushes WAL buffer, calls fsync() on WAL segment
T = 0.5 ms   Session receives WAL durability signal, returns OK to client

             ┆ (minutes later)

T = 60+ s    Background writer or checkpoint flushes dirty heap page to data file
             Heap page write is RANDOM I/O — but it's async, so no latency impact

The client sees ~0.5 ms latency. The heap page write that eventually hits disk is invisible to the client.

Tuning levers and their write-path effects

ParameterEffect on write path
shared_buffersLarger = more pages fit in memory = fewer dirty evictions during high write load
wal_buffersLarger = more WAL records buffered before fsync = better group commit batching
checkpoint_timeoutLonger = fewer checkpoints = smoother I/O but longer recovery time after crash
checkpoint_completion_targetHigher = checkpoint I/O spread over more time = less spike
synchronous_commit = offAck before WAL fsync — lose ≤200ms of work on crash, not correctness
full_page_writes = offSmaller WAL (no FPIs) but risk of torn-page corruption — only safe on storage guaranteeing atomic 8 KB writes

The highest-leverage knob for write throughput is synchronous_commit. Setting it to off on a write-heavy table while keeping it on for critical tables is a common pattern. You get 2–5x higher throughput on the non-critical table while maintaining full durability on the critical one.

MVCC write cost

An UPDATE to a single 100-byte row in PostgreSQL generates:

  • A new heap tuple: ~100 bytes of data
  • xmax update on the old tuple (logged in WAL): a few bytes
  • WAL records for both: maybe 60–200 bytes depending on row size and full_page_writes state

If this is the first modification to a page after a checkpoint: add an 8 KB FPI to the WAL for each page touched. A table UPDATE can generate orders of magnitude more WAL than the actual data changed.


6. Cassandra’s Write Path End-to-End

The distributed write path

Cassandra is a distributed system, so the write path begins before any single node sees the data.

Client → Coordinator node
Coordinator:
  1. Determine the partition key from the write
  2. Compute token = hash(partition_key) using Murmur3
  3. Find which replica nodes own this token range (consistent hashing ring)
  4. Send write to all replica nodes in parallel

Each replica node:
  A. Append mutation to the commit log (sequential write to disk)
  B. Apply mutation to the MemTable for this table (in-memory)
  C. Return ACK to the coordinator

Coordinator:
  5. Wait for quorum (or 1, or ALL, based on consistency level)
  6. Return success to client

Steps A and B happen in order on each replica. The commit log write is synchronous (disk I/O), the MemTable write is memory-only.

The commit log vs. WAL

Cassandra’s commit log is often called a WAL but there are important differences:

PostgreSQL WALCassandra Commit Log
ContentPhysical page deltas (or FPIs)Logical mutations (column values)
ReplicationWAL stream shipped to replicasNot used for replication
RecyclingAfter checkpoint confirms dirty pages are writtenAfter MemTable for that segment is flushed to SSTable
FormatTight binary format with LSN, resource manager IDsSerialized Cassandra partition mutations
Recovery startLast checkpoint LSNMost recent unflushed commit log segment

The Cassandra commit log is not used for replication. Replication happens at the coordinator level: the coordinator sends the write to all replicas simultaneously. This is fundamentally different from PostgreSQL, where replicas receive and apply WAL bytes.

This has implications: Cassandra can write to multiple data centers in parallel (just add DCs to the replica set). PostgreSQL streaming replication requires replicas to apply WAL in order, which introduces replication lag proportional to network RTT.

MemTable per table per node

Each Cassandra table has its own MemTable. A node with 100 tables has 100 MemTables simultaneously. The total memory budget is bounded by memtable_heap_space_in_mb. When a MemTable hits its threshold (or the commit log reaches a size limit), a flush is triggered: the MemTable is frozen and flushed to a new SSTable asynchronously.

The two-level partition index (Summary → Index → Data) exists because Cassandra partitions can be very large. A wide partition (all rows for one user in a social graph, or all events for one device over months) might have millions of rows. You cannot fit a full row-level index for every partition in RAM. The Summary stores one entry per ~128 partition index entries, which does fit in RAM, and points into the full partition Index on disk.

Multi-datacenter durability

When using CONSISTENCY LEVEL = QUORUM across multiple DCs, Cassandra writes to all DCs in parallel and requires a quorum response from each. No single commit log is shared across DCs. Each node maintains its own independent commit log and SSTable files.

This architecture makes Cassandra naturally multi-master: any DC can accept writes and the writes propagate via the replication mechanism. There is no “primary” in the PostgreSQL sense. The price is eventual consistency (with strong consistency available as a configuration trade-off against latency).


7. RocksDB’s Write Path End-to-End

RocksDB is a library, not a server. It embeds into applications that need a high-performance key-value store with a tunable write path. MySQL’s MyRocks storage engine replaces InnoDB with RocksDB. TiKV (CockroachDB’s and TiDB’s storage layer) uses RocksDB. Apache Kafka has experimented with RocksDB as a state store for Kafka Streams.

Understanding RocksDB’s write path means understanding what all of these systems do under the hood.

Write batch and group leader

RocksDB groups writes into WriteBatches. A WriteBatch is the atomic unit: zero or more Put/Delete operations applied atomically across the key-value store.

The write pipeline:

All concurrent WriteBatch submissions → join a "write group"
One thread in the group elected as "group leader"
Leader:
  1. Serialize all batches in the group into WAL records
  2. Write them to the WAL file (sequential I/O)
  3. If sync=true: call fsync()
  4. Signal all members of the group

All members (including leader):
  5. Apply their batch to the MemTable (can proceed in parallel)
  6. Return to caller

With enable_pipelined_write = true, step 4–5 for write group N overlaps with step 1–3 for write group N+1. This hides the MemTable write latency behind the I/O latency of the next group’s WAL write, increasing throughput significantly on fast storage.

Column families — independent LSM trees with a shared WAL

A column family (CF) in RocksDB is an independent LSM tree: its own MemTable, its own SSTable files in its own directories, and its own compaction queue. But multiple column families share a single WAL file.

This combination enables atomic cross-CF writes: a single WriteBatch can modify keys in CF_A and CF_B, and the WAL guarantees that either both modifications survive a crash or neither does.

TiKV’s use case: the default CF stores actual KV data, and the write CF stores MVCC version information (which timestamp committed which version of a key). A transaction must atomically write to both CFs. The shared WAL makes this possible.

TiKV: two column families, one WAL
───────────────────────────────────────────────────────────
WAL file (shared)

MemTable_default  |  MemTable_write
  ↓ flush           ↓ flush
SSTables_default  |  SSTables_write   (compact independently)

Each CF compacts independently, which is important: the write CF (MVCC versions) has different compaction needs than the default CF (actual data). In default, you want to retain all versions; in write, you garbage-collect old versions after a safe TTL.

Block cache — read-side complement to MemTable

RocksDB’s block cache is a read cache. Unlike PostgreSQL’s shared_buffers (which can contain dirty pages), the block cache holds clean, decompressed data blocks from SSTables. There are no dirty blocks in RocksDB’s block cache; writes always go through the MemTable and WAL.

With direct_io = true, RocksDB bypasses the OS page cache entirely for both reads and writes, relying exclusively on its own block cache. This prevents the OS page cache and the block cache from both holding copies of the same data (double-buffering), which wastes RAM and creates cache coherency complexity.

Rate limiter and compaction pressure

RocksDB includes a built-in rate limiter for compaction I/O:

// In C++, for illustration
auto rate_limiter = rocksdb::NewGenericRateLimiter(
    100 * 1024 * 1024,  // 100 MB/s compaction write limit
    100 * 1000,          // refill every 100ms
    10                   // fairness between reads and compaction writes
);
options.rate_limiter = rate_limiter;

Without a rate limiter, a compaction burst can saturate disk bandwidth and cause write stalls on the foreground write path. With the rate limiter, compaction yields to foreground writes when there is contention. This is the RocksDB knob that most resembles PostgreSQL’s checkpoint_completion_target — both spread out the background I/O burden.


8. Building It: B+ Tree, Leveled LSM, and STCS in Python

Four files in write_path_demo/. Run them directly.

Part A: B+ Tree with WAL and checkpoints (btree.py)

The B+ Tree has two page types: leaf pages (all data) and internal pages (routing keys only). All data lives in leaves; internal nodes just direct searches. Leaf pages are linked by right_sib pointers for range scans.

class Page:
    def __init__(self, kind: int) -> None:
        self.pid       = _next_pid()
        self.kind      = kind           # LEAF or INTERNAL
        self.dirty     = False
        self.needs_fpi = False  # True after checkpoint: next WAL write is an FPI

        self.keys: list[str] = []
        if kind == LEAF:
            self.vals:     list[Optional[str]] = []
            self.right_sib: Optional[int] = None   # linked list for range scans
        else:
            self.children: list[int] = []           # len(children) == len(keys) + 1

Every page modification goes through _touch(), which writes a WAL record before modifying the page:

def _touch(self, page: Page) -> None:
    """Write WAL record then mark dirty. Called before EVERY page modification."""
    self._tracker.wal_write(page)   # FPI if needs_fpi, else compact delta
    page.dirty = True
    self._dirty.add(page.pid)

The WAL tracker distinguishes Full Page Images from compact deltas:

def wal_write(self, page: Page) -> None:
    if page.needs_fpi:
        self.wal    += WAL_FPI    # PAGE_SIZE bytes — full page image
        page.needs_fpi = False
    else:
        self.wal    += WAL_DELTA  # 64 bytes — just the change

The insert path with split propagation (this is where it gets interesting):

def _insert(self, key: str, value: str) -> None:
    path = self._path_to_leaf(key)   # [root, ..., parent, leaf]
    leaf = path[-1]

    self._touch(leaf)    # WAL record for the leaf page
    # ... insert key/value into sorted position ...

    if leaf.full():
        self._split_leaf(path)   # may cascade up to root

def _split_leaf(self, path: list[Page]) -> None:
    leaf = path[-1]
    mid  = len(leaf.keys) // 2

    right = self._alloc(LEAF)
    self._touch(right)           # WAL for new right page
    right.keys = leaf.keys[mid:]
    right.vals = leaf.vals[mid:]
    right.right_sib = leaf.right_sib

    self._touch(leaf)            # WAL for modified left page
    leaf.keys = leaf.keys[:mid]
    leaf.vals = leaf.vals[:mid]
    leaf.right_sib = right.pid

    # Push the split key up — if parent fills, split_internal is called,
    # which calls _push_up again — cascading until root or empty parent path
    self._push_up(path[:-1], right.keys[0], right.pid)

Checkpoint flushes all dirty pages and resets the FPI flag on every page:

def checkpoint(self) -> int:
    n = len(self._dirty)
    self._tracker.checkpoint_flush(n)   # n × PAGE_SIZE bytes to disk
    self._dirty.clear()
    for page in self._pages.values():
        page.dirty     = False
        page.needs_fpi = True   # next write to any page needs a new FPI
    return n

After a checkpoint, the next modification to any page writes a Full Page Image into the WAL. This protects against the torn-page scenario: if a crash happens while writing a page, the FPI in the WAL has the complete pre-modification state, and recovery can reconstruct it.

Running demo_btree_write_amp():

B+ Tree Write Amplification  (50,000 writes, checkpoint every 20,000)
──────────────────────────────────────────────────────────────────────
  PAGE_SIZE = 4,096 bytes   ORDER = 16 keys/page
  Total pages:            1,017

  User bytes:         1,000,000
  WAL bytes:        10,635,968  (FPIs + compact deltas)
  Page bytes:       11,522,048  (checkpoint flushes)
  Write amp:              22.16x

The WA of ~22x looks high compared to typical PostgreSQL production numbers (2–20x). Two reasons:

  1. Our PAGE_SIZE (4 KB) holds only ORDER=16 entries (16 × 20-byte KVs = 320 bytes of user data). The FPI cost of 4,096 bytes per page is amortized across only 16 user writes. In production, PostgreSQL uses 8 KB pages with ORDER ≈ 200–400 for typical row sizes — the FPI cost is spread across far more user writes.

  2. No buffer pool in this implementation. In production, shared_buffers keeps hot pages warm across checkpoint cycles. A page that gets 100 writes between checkpoints contributes only one FPI and one checkpoint flush — amortized across 100 writes. Our demo has no such caching: every page is flushed at each checkpoint regardless of write frequency.

The demo_split_cascade() demo shows why the split path is expensive:

Split Cascade Demo  (ORDER=4, 64 sequential keys)
──────────────────────────────────────────────────
  Tree pages:     48
  Expected depth: ~3
  Write amp:    466.86x

466x for 64 inserts is extreme, but it reveals the structure: every insert into a fresh tree triggers cascading splits that touch every level from leaf to root. Each page at each level gets a WAL record. For a 3-level tree, one insert can write WAL records for 3+ pages. With tiny data (1-byte values) relative to PAGE_SIZE=4096, the ratio is catastrophic. This is why sequential bulk-loads into B-Trees use the COPY path in PostgreSQL which bypasses normal split mechanics and fills pages directly.

Part B: Skip List (the MemTable) + SSTable (lsm.py)

The MemTable is a real skip list — not a SortedDict. The distinction: a skip list supports O(log n) concurrent updates without a global lock (lock-free implementations are standard in Java’s ConcurrentSkipListMap and in production LSM systems). Sorted iteration for SSTable flush is O(n) by following the level-0 linked list.

def scan(self) -> Iterator[tuple[str, Optional[str]]]:
    """O(n) sorted iteration — the basis for SSTable flush."""
    node = self._head.forward[0]
    while node:
        yield node.key, node.value
        node = node.forward[0]

The SSTable three-gate read (from lsm.py):

def get(self, key: str) -> Optional[str]:
    if not self._bloom.might_contain(key): return None    # Gate 1: Bloom
    # Gate 2: binary search the in-memory index
    lo, hi = 0, len(self._index) - 1
    data_off = None
    while lo <= hi:
        mid = (lo + hi) // 2
        k, off = self._index[mid]
        if k == key:   data_off = off; break
        elif k < key:  lo = mid + 1
        else:          hi = mid - 1
    if data_off is None: return None                      # Bloom false positive
    with open(self.path, 'rb') as f:                      # Gate 3: disk read
        f.seek(self._data_start + data_off)
        key_len, val_len = struct.unpack('>HH', f.read(4))
        f.read(key_len)
        return f.read(val_len).decode() if val_len else None

Part C: Size-Tiered Compaction (SizeTieredLSMTree in lsm.py)

Leveled compaction maintains non-overlapping key ranges at each level and merges precisely. STCS is simpler: group SSTables by size, compact when a tier fills up. Files within a tier can overlap in key space.

class SizeTieredLSMTree:
    """
    Group SSTables by size tier. When any tier has ≥ MIN_THRESHOLD (4) files,
    compact all of them into one larger file.

    Key difference from leveled: no key-range partitioning constraint.
    Files within a tier may overlap → reads must check all files.
    But fewer compaction cycles → lower write amplification.
    """
    MIN_THRESHOLD = 4
    TIER_RATIO    = 4.0   # each tier is 4× larger than the previous

    def _tier_for(self, sst: SSTable) -> int:
        size, tier, threshold = max(sst.bytes_written, 1), 0, self.BASE_BYTES
        while size >= threshold:
            tier += 1; threshold = int(threshold * self.TIER_RATIO)
        return tier

    def _compact_tier(self, tier_files: list[SSTable]) -> None:
        # Merge: older first → newer overwrites (no key-range constraint needed)
        merged: dict[str, Optional[str]] = {}
        for sst in tier_files:
            for key, val in sst.scan():
                merged[key] = val   # newer overwrites older
        final = [(k, v) for k, v in sorted(merged.items()) if v is not None]
        # Write one merged SSTable, delete inputs
        new_sst = SSTable.flush(final, new_path)
        self.tracker.record_disk_write(new_sst.bytes_written)

The critical property: because STCS files can overlap in key space, compaction input is just “all files in this tier.” You don’t need to find overlapping files in the next level. This simplicity is why STCS has lower write amplification than leveled — fewer bytes rewritten per compaction event.

Running demo_compare_compaction():

Leveled vs. Size-Tiered Write Amplification  (80,000 writes)
──────────────────────────────────────────────────────────────
  Leveled (LSMTree)
    User bytes:     2,000,000
    Disk bytes:     6,905,684
    Write amp:           3.45x

  Size-Tiered (STCS)
    User bytes:     2,000,000
    Disk bytes:     6,211,538
    Write amp:           3.11x

STCS shows ~10% lower write amplification here. The gap grows at larger key spaces — see the sensitivity analysis in benchmark.py.

Part D: Three-way benchmark (benchmark.py)

Running run_comparison() (200K writes, 8K key space):

Write Amplification: 200,000 writes, key_space=8,000
──────────────────────────────────────────────────────────────────────
Strategy                         25%      50%      75%    Final
──────────────────────────────────────────────────────────────────────
LSM leveled (L0+L1)             2.2x     2.5x     2.6x    2.7x
LSM size-tiered (STCS)          2.2x     2.6x     2.7x    2.8x
B+ Tree (WAL+checkpoint)       29.9x    32.2x    33.0x   33.3x

And run_sensitivity() — varying key space across 5 orders of magnitude:

key_space    LSM leveled    LSM STCS    B+ Tree
────────────────────────────────────────────────
      500          0.01x       0.01x       4.65x
    2,000          0.06x       0.03x       8.56x
    8,000          2.75x       2.79x      24.13x
   30,000          6.27x       4.57x      69.93x
  100,000         10.73x       5.40x     131.76x

The crossover pattern tells the real story:

  • High key reuse (small key space): LSM WA is tiny because the MemTable absorbs and deduplicates many writes to the same key before they ever hit disk. B-Tree WA is modest because hot pages get many writes per checkpoint cycle (one dirty flush covers many user writes).

  • Large key space (insert-heavy): LSM WA grows as LT\approx L \cdot T — the same data is rewritten at each compaction level. B-Tree WA grows dramatically because every new page is written once at checkpoint regardless of how rarely it was touched. Without a large buffer pool, every checkpoint must flush every dirty page, regardless of page heat.

  • STCS consistently beats leveled at large key spaces (5.4x vs 10.7x at key_space=100K). This matches the theory: STCS has fewer compaction rounds per byte of user data. The penalty is read amplification — a point read must check all SSTables in all tiers rather than at most one per non-overlapping level.


9. Side-by-Side Comparison

DimensionPostgreSQLCassandraRocksDB
Write logWAL (physical/logical)Commit log (logical only)WAL (logical)
In-memory bufferBuffer pool (shared pages)MemTable per tableMemTable per column family
On-disk primary structureB+ Tree heap + indexesSSTables (LSM)SSTables (LSM)
CompactionN/A (VACUUM is different)STCS / LCS / TWCSLeveled / Universal / FIFO
Typical write amplification2–20x (workload dependent)3–10x10–30x (leveled)
Replication mechanismWAL streamingCoordinator → replicasApplication-level
MVCC approachHeap versioning (old tuples kept)Timestamps in data cellsSequence numbers per CF
Best write patternMixed OLTP, strong consistencyWrite-heavy, time-seriesEmbedded, extreme write throughput
Space amplification1.1–2x (VACUUM required)1.1–2x1.1–1.5x (leveled)

The “why” behind each cell

Commit log vs. WAL: Cassandra stores logical mutations (what values changed) rather than physical page deltas (which bytes in which page changed). This makes the commit log portable across node replacements (a new node can replay the commit log without needing a matching heap layout) but means you can’t use it for replication — the replica needs the full data, not page deltas.

MemTable per table: Cassandra’s table-level MemTables mean that a large write burst to one table doesn’t block flushes for other tables. The cost is that a node with 200 tables uses 200× the MemTable metadata overhead, and tuning memtable_heap_space_in_mb requires accounting for all tables.

VACUUM vs. compaction: PostgreSQL’s VACUUM is not the same as LSM compaction. VACUUM reclaims space from dead heap tuples created by MVCC updates. It does not reorganize the B-Tree structure. B-Tree pages can become bloated with dead tuples that VACUUM removes, but the page structure remains. LSM compaction is fundamentally different: it merges and rewrites entire SSTable files.


10. The Durability-Performance Trade-off Space

Durability is not binary. There is a spectrum from “data is in RAM” to “data is on disk in multiple data centers.” Every point on that spectrum has a latency cost.

HIGH DURABILITY

      │  PostgreSQL                 PostgreSQL
      │  sync_commit=on            sync_commit=on
      │  (WAL fsynced, one node)   + synchronous_standby

      │  Cassandra                 Cassandra
      │  CONSISTENCY=ONE           CONSISTENCY=QUORUM
      │                            (multi-DC)

      │  PostgreSQL                RocksDB
      │  sync_commit=off           sync=false
      │  (ack before fsync)        (ack before fsync)

      │  In-memory only            ← (no durability guarantee)

LOW   └────────────────────────────────────────────────────►
DURABILITY                                          HIGH THROUGHPUT

Key points on this map:

  • synchronous_commit = off (PostgreSQL): acknowledge before the WAL is fsynced. Risk: lose ≤200ms of committed transactions on crash. Not a corruption risk — the database remains internally consistent, you just lose recent work.

  • Cassandra CONSISTENCY=ONE: acknowledge as soon as one replica writes to its commit log. Risk: if that replica crashes before the MemTable is flushed, the commit log might not be replayed. Very fast (~1ms writes), low durability guarantee.

  • Cassandra CONSISTENCY=QUORUM with RF=3: two of three replicas must commit to their commit logs before acknowledging. This tolerates one replica failure. The standard production setting.

  • PostgreSQL with synchronous_standby_names: the primary waits for a named standby to apply WAL before acknowledging. This protects against primary failure but adds network RTT to every commit.

The non-obvious insight: there is no “durable” or “not durable.” Every system has a dial. Understanding the WAL and commit log mechanics lets you reason precisely about what you’re trading away when you move the dial.


11. Real Numbers

Grounding theory with actual published benchmark data:

PostgreSQL pgbench (TPC-B-like workload)

  • Hardware: NVMe SSD, 32 cores
  • synchronous_commit = on: ~15,000 TPS
  • synchronous_commit = off: ~45,000 TPS
  • The 3x improvement comes entirely from removing the fsync wait from the critical path. The actual data volume is identical.

RocksDB (Facebook internal benchmarks, 2022)

  • Bulk load on NVMe: ~400 MB/s sustained write throughput
  • Random writes at 120,000 ops/sec on a 3.8 TB dataset
  • Write amplification at steady state with leveled compaction: ~12–15x

Cassandra (DataStax benchmarks, various)

  • Write-heavy workload (INSERT only): 300,000–500,000 writes/sec per node on modern hardware with SATA SSD
  • p99 write latency with RF=3, CONSISTENCY=QUORUM: 1–3ms
  • p99 write latency with RF=3, CONSISTENCY=ONE: 0.3–0.8ms

All three strategies measured in benchmark.py — see Section 8 for the full output and analysis. Summary (200K writes, 8K key space):

Strategy               Final WA
──────────────────────────────────
LSM leveled (L0+L1)      2.7x
LSM size-tiered (STCS)   2.8x
B+ Tree (WAL+checkpoint) 33.3x

The B+ Tree WA is high in this simulation for the reasons explained in Section 8 (small ORDER relative to data size, no buffer pool caching across checkpoint cycles). In production PostgreSQL with shared_buffers = 8 GB and ORDER ≈ 200–400, the effective WA for hot workloads is typically 2–10x.


12. When to Choose What

Staff-level decision criteria — not “Cassandra is for writes”:

Choose B-Tree + WAL (PostgreSQL) when:

  • You need complex transactions: multi-table operations, complex WHERE clauses for updates, referential integrity via foreign keys. LSM trees (as storage engines) do not help you here — you need a full SQL engine with ACID transactions, and PostgreSQL is the benchmark.
  • Your working set fits in shared_buffers: if the hot pages stay in memory, B-Tree write amplification stays low and you get excellent point-read performance without Bloom filter overhead.
  • Read latency matters as much as write latency: B-Trees have lower read amplification than LSM trees (1–3 levels of B-Tree vs. checking multiple SSTables).
  • You have strong consistency requirements: PostgreSQL’s transaction isolation, WAL-based replication, and point-in-time recovery are mature and well-understood.

Choose LSM (Cassandra) when:

  • Write throughput is the primary bottleneck: LSM trees are write-optimized by design. If you need 100K+ writes/sec per node, Cassandra is the right choice.
  • Data has a clear partition key: Cassandra’s data model works well when you can express all queries as “get all rows for partition X” or “get rows for partition X with row key in range [a,b]”. Scatter-gather queries (WHERE on non-partition-key columns) are expensive.
  • Time-series or event-log workloads: TWCS compaction makes Cassandra extremely efficient for TTL-based time-series data. Entire time windows drop atomically.
  • Geographic distribution is required: Cassandra’s multi-master architecture handles multi-DC active-active replication natively.

Choose LSM (RocksDB) when:

  • You’re building a higher-level system: databases like TiKV (TiDB, CockroachDB), Kafka Streams state stores, and embedding into MySQL (MyRocks) all chose RocksDB for the same reason — extreme write throughput on flash with tunable trade-offs.
  • You need column families: if different “namespaces” within your store have different compaction requirements, column families let you tune them independently while sharing a WAL.
  • Operational complexity is acceptable: RocksDB’s tuning guide is extensive. Leveled vs. universal compaction, rate limiters, block cache sizing, write buffer sizing — there are dozens of knobs. This pays off at scale but adds operational burden.

The non-obvious advice

If your bottleneck is not write throughput, don’t choose an LSM tree. The operational complexity of compaction tuning — monitoring compaction debt, handling write stalls, managing space amplification during heavy compaction — is real. B-Trees are simpler to operate and reason about, and PostgreSQL’s query planner, indexing capabilities, and ecosystem are unmatched.

The question to ask is not “which system is faster?” but “which system’s bottlenecks align with my workload’s characteristics?“


13. Conclusion

Back to the 2 AM on-call scenario. The PostgreSQL dashboard shows spikes every few minutes: those are checkpoint I/O bursts, background writer flushing dirty pages accumulated since the last checkpoint. The Cassandra dashboard shows a steadier latency with occasional deeper pauses: those are compaction-induced write stalls when the L0 file count grows too high and the write path slows down to give compaction time to catch up.

Neither system is broken. Each is operating exactly as designed.

The mental models that stay with you:

  • WAL: “sequential is fast, so log changes before applying them.” The WAL is a sequential append-only file. Its LSN is a byte offset. Group commit batches N transactions behind a single fsync(). Recovery replays WAL from the last checkpoint forward. Streaming replication ships the same WAL bytes to replicas.

  • B-Tree + WAL: “organize data for reads, use WAL for write durability.” Writes land on in-memory pages, are logged in the WAL, and are eventually flushed to their random positions on disk at checkpoint. Write amplification is driven by the ratio of working set to buffer pool.

  • LSM Tree: “organize writes as sequential I/O, pay for reads with Bloom filters and compaction.” All writes go to RAM (MemTable), then flush sequentially to immutable SSTables. Reads check multiple SSTables filtered by Bloom filters. Compaction reclaims space and bounds read amplification at the cost of rewriting data multiple times.


Further Reading

  • O’Neil et al., “The Log-Structured Merge-Tree” (1996) — the original LSM paper. Surprisingly readable for an academic paper and covers the write amplification analysis rigorously.

  • PostgreSQL sourcesrc/backend/access/heap/heapam.c (heap inserts, MVCC), src/backend/storage/smgr/md.c (the “MD” magnetic-disk smgr layer), src/backend/access/transam/xlog.c (WAL write path). The comments are excellent.

  • RocksDB Wiki — “RocksDB Tuning Guide” — the most complete guide to the compaction knobs. Reading this after this post will make it much more tractable.

  • Huang et al., “WiscKey: Separating Keys from Values in SSD-Conscious Storage” (FAST 2016) — a critique of standard LSM design: when key-value pairs are large, separating them (keys in LSM, values in a value log) dramatically reduces compaction write amplification. Titan in TiKV implements this idea.

  • CASSANDRA-6696 — the Jira ticket for TWCS implementation. Reading the comment thread shows the engineering trade-offs made in the actual design: why the window boundary is strict, why cross-window compaction was explicitly disabled, how TTL interacts with compaction.

  • Athanassoulis et al., “Designing Access Methods: The RUM Conjecture” (EDBT 2016) — a theoretical framework showing that Read, Update, and Memory overheads form a trilemma: you can optimize for any two, but not all three. B-Trees optimize for reads; LSM trees optimize for updates (writes) at the cost of reads and space.

Thanks for reading.