Why Modern Databases Rely on LSM Trees

database

6 min read

This blog is based on insights from the book Database internals. Explore morehere

When you think of databases, B-Trees often come to mind. But for modern, write-heavy workloads, B-Trees quickly run into limitations. That's where the Log-Structured Merge (LSM) engine comes in, a storage design that powers systems like RocksDB, Cassandra, HBase, and more.

While each database has its own implementation details, the core mechanics of LSM engines remain the same. Understanding these not only demystifies how your database works under the hood but also helps you tune it for your workload.

Let's unpack the internals step by step.

Core components of an LSM engine

An LSM engine is composed of a few core components that work together to balance speed and reliability:

Write-ahead log (WAL)

Every write first lands in the WAL, which is a sequential log on disk. This guarantees durability in case of crashes or restarts.

Memtable

An in-memory data structure (often a skip list) that holds recent writes. Because it's in memory, inserts, updates, and deletes are fast. When the memtable fills up, it's frozen and queued for flushing to disk.

SST files

Immutable, sorted files on disk known as Sorted String Tables (SSTs). Each file covers a key range and stores data in sorted order. SSTs generally include sub-structures like:

Index blocks for fast key lookups
Filter blocks (often Bloom filters) to avoid unnecessary disk reads

Block cache

Frequently accessed SST blocks are cached in memory (commonly with an LRU cache) to speed up reads.

The write path

Writes are where LSM engines shine. Here's what happens step by step:

A write (insert, update, or delete) is appended to the WAL. Deletes are stored as special markers called tombstones.
The write is added to the active memtable. Both operations are fast: WAL writes are sequential (often buffered), and memtable updates happen in memory.
Once the memtable reaches its capacity, it becomes immutable and is flushed to disk as an SST file. The corresponding WAL segment can then be discarded.
Over time, overlapping SST files accumulate, containing redundant versions of the same keys or tombstones. This is where compaction (explained more in the next section) comes in.

You can refer to the image given below to understand this flow better:

Compaction: cleaning up the mess

Compaction is the LSM engine's form of garbage collection. It:

Merges overlapping SST files
Discards outdated versions of keys and deletes marked by tombstones
Produces fresh SSTs with consolidated data

While compaction itself is a huge topic, here are the two most common strategies:

Tiered compaction

Files of similar size are merged repeatedly into bigger ones. This reduces write overhead but may leave more files to check during reads. This kind of compaction strategy is used in HBase.

Leveled compaction

Data is organized into levels (L0, L1, L2, …), with each level having a fixed size. Within a level, each key range appears only once. This speeds up reads but increases write amplification. RocksDB uses this compaction strategy as a default.

One important thing to remember is that compaction is crucial but comes at a cost: it consumes CPU, I/O, and temporary disk space. It also leads to the well-known write amplification problem.

The read path

In this section, we will see how reads function in an LSM engine.

Reads follow a series of steps because data can live in multiple places:

Memtables (active and immutable) are checked first, since they contain the most recent writes.
If the key isn't in the memtables, the engine searches the SST files on disk, starting with the most recent ones. For each candidate SST, Bloom filters are used to quickly rule out files that definitely do not contain the key.
Once a potential SST file is identified, the engine locates the corresponding data block by using the index blocks within the SST file. If the data block is already in the block cache, it can be returned immediately. Otherwise, it is read from disk and then cached for future access. The candidate data block is then searched for the record. The SST file search continues until the record is found or there are no more SST files left.

Because a read may involve consulting several components, it is generally slower than a write. This becomes even more noticeable during range scans, where multiple overlapping SST files might need to be read to cover the full input range.

Ending notes

LSM storage engines are among the most widely used designs in modern databases. Although implementations differ, the core principles remain the same: optimize for writes, manage data with compaction, and balance the trade-off between speed and efficiency.

Grasping these fundamentals not only reveals what's happening under the hood but also equips you to tune and optimize any LSM-based database for your own workload.

This article is just a starting point. The world of LSM engines is far richer than what can be covered here, with countless variations, optimizations, and trade-offs that shape how different systems perform. I encourage you to keep exploring.

Thank you for reading.

I regularly read a wide range of technical books and share my learnings here, whether it's database internals, system design, or just clever engineering ideas. Feel free to explore the rest of the blogs if that sounds interesting to you.

Thank you for reading.

Happy coding!