Why Modern Databases Rely on LSM Trees

database

6 min read

When you think of databases, B-Trees often come to mind. But for modern, write-heavy workloads, B-Trees quickly run into limitations. That's where the Log-Structured Merge (LSM) engine comes in, a storage design that powers systems like RocksDB, Cassandra, HBase, and more.

While each database has its own implementation details, the core mechanics of LSM engines remain the same. Understanding these not only demystifies how your database works under the hood but also helps you tune it for your workload.

Let's unpack the internals step by step.


Core components of an LSM engine

An LSM engine is composed of a few core components that work together to balance speed and reliability:

Write-ahead log (WAL)

Every write first lands in the WAL, which is a sequential log on disk. This guarantees durability in case of crashes or restarts.

Memtable

An in-memory data structure (often a skip list) that holds recent writes. Because it's in memory, inserts, updates, and deletes are fast. When the memtable fills up, it's frozen and queued for flushing to disk.

SST files

Immutable, sorted files on disk known as Sorted String Tables (SSTs). Each file covers a key range and stores data in sorted order. SSTs generally include sub-structures like:

Block cache

Frequently accessed SST blocks are cached in memory (commonly with an LRU cache) to speed up reads.


The write path

Writes are where LSM engines shine. Here's what happens step by step:

You can refer to the image given below to understand this flow better:


Compaction: cleaning up the mess

Compaction is the LSM engine's form of garbage collection. It:

While compaction itself is a huge topic, here are the two most common strategies:

Tiered compaction

Files of similar size are merged repeatedly into bigger ones. This reduces write overhead but may leave more files to check during reads. This kind of compaction strategy is used in HBase.

Leveled compaction

Data is organized into levels (L0, L1, L2, …), with each level having a fixed size. Within a level, each key range appears only once. This speeds up reads but increases write amplification. RocksDB uses this compaction strategy as a default.

One important thing to remember is that compaction is crucial but comes at a cost: it consumes CPU, I/O, and temporary disk space. It also leads to the well-known write amplification problem.


The read path

In this section, we will see how reads function in an LSM engine.

Reads follow a series of steps because data can live in multiple places:

Because a read may involve consulting several components, it is generally slower than a write. This becomes even more noticeable during range scans, where multiple overlapping SST files might need to be read to cover the full input range.


Ending notes

LSM storage engines are among the most widely used designs in modern databases. Although implementations differ, the core principles remain the same: optimize for writes, manage data with compaction, and balance the trade-off between speed and efficiency.

Grasping these fundamentals not only reveals what's happening under the hood but also equips you to tune and optimize any LSM-based database for your own workload.

This article is just a starting point. The world of LSM engines is far richer than what can be covered here, with countless variations, optimizations, and trade-offs that shape how different systems perform. I encourage you to keep exploring.

Thank you for reading.

I regularly read a wide range of technical books and share my learnings here, whether it's database internals, system design, or just clever engineering ideas. Feel free to explore the rest of the blogs if that sounds interesting to you.

Thank you for reading.

Happy coding!