RAVENS: Decompressing with only 4kB of memory

When trying to port BSDiff to our platform, we encountered two main challenges. The first was performing the update in place, which resulted in us solving the WbR conflict problem. The second was a bit more intrinsic to the original approach: the idea of compressing part of the payload. Due to our highly stringent memory requirements (the target I had in mind was 16kB of RAM), we had to find a compression algorithm enabling decompression with very little memory, and very little access to the already decompressed data (we wanted to stream the decompressed data in the RAM buffer, not decompress the full payload).

In this article, I’ll describe how we proceeded, and why and how we built a compression algorithm designed to be decompressed in a 4kB rolling buffer.
First though, a small disclaimer: I’m not at all specialized in compression algorithms and information theory, and thus also had to keep this relatively simple. We also were fairly time constrained.

Existing solutions

Fundamentally, data compression is a mean to exchange space for computation. Systems often rely on building a pattern representation, and referring to it as a mean to describe the actual stream. Hopefully, the patterns are large enough for the references to be substantially smaller than the pattern, thus resulting in a good compression ratio.
BSDiff is using the BZ2 algorithm to perform it compression. After some research, we realized that the memory footprint of decompression algorithm isn’t a common concern. We finally ran into the following benchmarks, disclosing the memory usage of multiple algorithms including BZ2. First, the unit used made us feel queasy, as we’re trying to fit in a handful of kilobytes, not even having megabytes of memory (all types combined!). The data showed that bzip2 was unusable, even at very low compression level. GZip is a bit of an oddball, thanks to a reputation for excellent memory efficiency but no clear benchmark that we could find to ensure we wouldn’t loose our time making our changes.
We therefore had a quick look at some esoteric compression algorithms such as Apple’s LZFSE (which requires 47kB of memory) and LZFX (which only requires 8kB of memory!). After discovering LZFX, we stopped our search alternative compression algorithms and decided to dive deeper in the ways we could halve the memory requirement of the decompression routine.
This was because we were looking for an algorithm optimized for the memory we had available. Smaller footprint were possible (for instance, heatshrink advertises 50 bytes!) but make compromises we don’t necessarily want. Algorithms are optimized to their use cases and we liked what we were seeing with LZFX.

LZFX

LZFX is compression algorithm designed as a reboot of the venerable (first released in 2000) libLZF, designed to share the same format.
The way LZFX work is fairly simple. The compressed file is a chain of small byte-aligned token encoding various patterns. Those patterns either provide new data to append to the decompressed stream, or a relative offset and a length, which is used to repeat a pattern we decompressed in the past (basically, we read the length starting N bytes before our writing head, where N is the specified offset). The tokens are the following:

  • 0b000LLLLL: In one byte, five bits (L) are used to encode a length. This pattern is used to encode a sequence of uncompressed new data to insert (the uncompressed bytes follow);
  • 0bLLLooooo 0boooooooo: Over two bytes, 3 bits are used to encode a length (L) and the rest code a 13 bits (max is 8192) offset (o). This is used as an efficient mean to code a small pattern reuse. It is important to note that LLL can’t be 111 (or 8), as it would collide with the last pattern;
  • 0b111ooooo 0bLLLLLLLL 0boooooooo: Over three bytes, 8 bits are used to encode the length (L + 8), on top of 13 bits still being dedicated to the offset (o).

Multiple optimizations are added to that, such as adding one to any value that can’t be equal to 0 (such as the length), but this is the gist of the algorithm.

LZFX-4K

Our main concern with this format is that it allows too large of offsets: we need at any point to be able to look 8192 bytes back, or 8kB. Considering we would like to fit in 4kB (assuming we can stream the decompression), all we need to do is to shave a bit off the offset. On top of that, after some benchmarks of our use case with a lot of regular data (0s), we made some more profound changes to the format in order to raise the compression ratio.
LZFX-4K code the three tokens as follow:

  • 0b0LLLLLLL: The literal token can now use 7 bits to encode length, letting us use it less often when inserting large chunks of new data (such as an extra segment);
  • 0b1LLLLLoo 0boooooooo: Back-references coded over two bytes are meant to be a shortcut, used as often as possible (as it’s cheaper). Our approach use 5 bits to code the length and 10 bits to code the offset. 10 bits isn’t enough to code 4096 (the maximum offset) but we’re assuming (and measuring) that moderately long, fairly short patterns are common enough to make this approach worthwhile;
  • 0b11111Loo 0boooooooo 0booLLLLLL: Over three bytes, 7 bits are used to code the length (to which 0b11111 = 31 is added, if we’re sure we overflowed the length argument, that is the offset is less or equal to the two-bytes maximum offset). On top of that, the full 12 bits necessary to code 4096 are allocated to the offset (👻).

This approach surrender 1 bit that was previously used to code the length, meaning we may have to repeat the three-bytes pattern if very long (> 160 bytes, vs 264 in LZFX) sequences are repeated. In exchange of that, we can use a smaller pattern more often. The compression ratio of LZFX-4K is still mediocre, making files only 0.69× smaller than the original but is basically equivalent to the real LZFX (0.70×, very slightly worse). In this specific benchmark, BZ2 achieved a ratio of 0.59×.

Streaming the decompression?

Alright. Either using LZFX-4K or some other decompression algorithm, let’s we got a working routine. Now what? We don’t have enough memory to fully decompress the data and therefore, we need some kind of streaming which enables at any point access backward up to 4096 bytes. Ideally, this would fit in a 4kB buffer. Although a simple streaming is very simple to imagine, it run into a bunch of practical details that can be summarized by two questions: “What if a read start in the available buffer but lands afterward?” and the corollary “What if a decompression token provide more data than we can fit in the buffer?”.

The answer to the first question is fairly simple: all reads from the buffer go through of couple of helpers who, when we get close to the end of the buffer call consumeByte. Whenever the last byte is consumed and a new one is necessary, the buffer is refilled using lzfx_decompress. The current reading position is tracked using a BSDiffContext structure.

The lzfx_decompress util is using a modified implementation of the LZFX-4K decompression code. The function take the lzfx member of the BSDiffContext structure, which is a Lzfx4KContext structure. Substantially more complete, this structure move around the general context of LZFX-4K, but also the details on the potentially interrupted token. lzfx_decompress will then call resumeCurrentSegment, trying to finish the last token (or suspend it again if it fully fill the buffer). Once the old state was processed, we can proceed in reading the next tokens, occasionally looking back in the buffer to old data.
The back-references are handled through two nifty utils which abstract any processing by the actual code of the ring buffer abstraction, including most of the case in which we start at the end of the buffer and the read need to proceed at the start (the principle of a ring buffer).

Through careful abstraction of the reads, we could turn a piece of code highly vulnerable to buffer overrun to something we feel fairly confident in, without any excessive effort.

Final words

When the work on RAVENS was first started, I didn’t really expect to have to customize a compression algorithm and to have to do pointer gymnastic. A nice surprise of this system is also that it doesn’t require any resiliency, as it’s lossless and the decompression can thus be replayed freely, without any impact on context restoration.
All in all, this was a very interesting exercise and I think RAVENS is better with LZFX-4K than it would have been otherwise. Feel free to fire me a message if you think I overlooked a better alternative, or if I went a bit too fast over some notions.