RoaringBitmap: A Bitmap on Steroids

data structures

6 min read

RoaringBitmap is a compressed Bitmap format designed to efficiently represent sets of integers. It strikes an excellent balance between memory efficiency and performance and enables fast operations like union, intersection, and iteration.

Thanks to its effectiveness, RoaringBitmap has become the de facto Bitmap implementation in systems such as Apache Lucene, Apache Druid, ClickHouse, InfluxDB, and many others.

In this article, we'll explore the internal encoding strategies used in RoaringBitmap, how it compresses data, why these strategies are effective, and an example to see it in practice.

Note that this article is based on the implementation of a RoaringBitmap used in Apache Lucene. While other systems may introduce slight variations in how it is implemented or optimized, the core principles remain consistent across most implementations.

Also, please consider reading this article on the internals of a Bitmap to understand this article better.


How does this work?

Instead of using a single large Bitmap for the entire 32-bit integer space, RoaringBitmap splits the space into smaller chunks called blocks or containers. Each block handles a fixed range of values and is compressed independently, depending on how many values it contains.

Each 32-bit integer is split into:

This means each block manages a range of 2¹⁶ = 65536 possible values. How a block is stored depends on how many values are present in it.

RoaringBitmap uses three encoding strategies:

Sparse blocks (≤ 4096 set values)

If a block contains up to 4096 values, it is considered sparse. In this case, the values are stored directly in a sorted short[] (since 16 bits are enough to represent any value within the block, and a short takes only 16 bits).

This format is compact and fast for small sets.

Dense blocks (≥ 61440 set values)

If a block contains 61440 or more set values, it is considered dense. Since the total range is 65536, this means that fewer than 4096 (the difference between 65536 and 61440) values are not set.

Instead of storing all the set values (which would take more space), RoaringBitmap stores the complement: the list of unset values, again using a short[]. When reading the block, the logic is inverted to return the correct set of values.

This approach saves space when a block is nearly full.

Medium density block (4097–61439 set values)

If a block contains a moderate number of values, which is more than 4096 but fewer than 61440, RoaringBitmap uses a Bitmap representation. This is a 65536-bit (8KB) fixed-size bitset, where each bit directly indicates the presence or absence of a value.

This strikes a balance between memory use and access speed.


Why these thresholds?

You might wonder: why exactly 4096 is better suited for an array than a Bitmap? Is it just a magic number?

No, this is not a magic number. This is chosen simply because above this number of documents in a block, a Bitmap becomes more memory-efficient than an array. Let's see the maths behind it.

Threshold for Sparse

A short[] stores each offset using 2 bytes.

So, if a block contains ≤ 4096 values, the array is smaller or equal in size compared to a Bitmap, making it more efficient.

Threshold for Dense

If a block has ≥ 61440 values set, then ≤ 4096 values are unset.

Since we're storing fewer items, it's more efficient to invert the logic and represent the block as a sparse list of absent values.


Let's see an example

In the example above, we first find the block for each number and its corresponding index in each block.

Once that is defined, we group them together and then choose the right encoding strategy based on the cardinality of the block.

In this example, all the blocks would be categorised as sparse and will use a short[].


Closing notes

RoaringBitmap is a powerful and practical data structure that offers a highly efficient way to represent large sets of integers. Its clever use of multiple encoding strategies makes it suitable for a wide range of use cases where both performance and memory usage are critical.

If you're interested in digging deeper into how RoaringBitmap works or seeing it in action across different platforms, here are some great resources to explore:

https://roaringbitmap.org/
https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps
https://jvns.ca/blog/2016/01/23/fast-integer-sets-with-roaring-bitmaps/
https://github.com/RoaringBitmap/RoaringBitmap

Hope you learned something new today. If you like geeking out over clever engineering, check out my other blogs here.

Thank you for reading. Happy coding!