RAVENS: How a small problem turned big, or the tale of WbR conflicts

While working on the project that led to RAVENS, Write-before-Read (WbR) conflicts appeared a little bit out of nowhere and almost became a showstopper. The algorithm Hugin uses to get rid of them is by far the most complex and time intensive part of the whole project. In this article, I’ll try to describe why they are such a big deal, how we got rid of them and why it took me so long. I would also like to apologize the poor showing of my artistic skills.

The context

A delta file format is a data structure meant to detail the differences between two files, so that the first could be regenerated from the second and a “delta” file, describing the differences. One of the best delta file format that I know of is the one generated by BSDiff, made by Colin Percival a few years ago.

How BSDiff work

The BSDiff philosophy is fairly simple: find sections of old data that somewhat look like the final data. Combine those old data with corrections from the Delta Field so that the resulting data are exactly what you’re looking for. Whenever there is too little matching old data for the correction to be worth it, use data from the Extra Field, which are then simply appended to the end.

The astute reader will notice but the Delta Field, added to the Extra Field have the same size as the final file! With the metadata added on top of that, it doesn’t sound really efficient! The trick of the format if that the Delta Field is very regular (the data we’re reusing is mostly the same as the final data, so the Delta Field is full of “no change” symbols, usually 0). Being regular mean that it will compress really well! And indeed it does. The Delta Field and the Extra Field are compressed using BZ2 and thus reach higher “compression ratio” (size of delta vs size of full file) than its competition, that only use exactly matching sections of old data. BSDiff is however designed to build a new file. It takes the old file, the delta and will perform random reads in the old file and semi-random in the delta while creating a new file. In the diagram below, you can see a file being rebuilt. The Control section is read as a sequence of a repeating pattern. In this case, the three numbers we read are A, D and X:

  • A is the position of the reading head in the old file (the point from which we will start reading, the first blue bar);
  • D is the length to read from the old file (how far ahead is the second blue bar) and the Delta Field (we progress linearly in the delta file, starting from where we left off at the first green bar);
  • X is the length to read from the Extra Field. Again, the first red bar is where we stopped reading last time.

The data from the old file and the Delta Field is then read, combined and appended to the new file. After it, the Extra section is added and we read the next control pattern.

A file (bottom) being built using a BSDiff delta (left) and the old file (top)

This should give you an understanding of how BSDiff work, but you can see it is designed to work with two files (the old file and the new file) that must be independent. We’re looking for a system able to perform the update in-place, that is where the new file and the old file are one and the same. It can’t be that hard, right?

The Problem

Okay, I think you read the title of this post so you can guess where this is going: it is surprisingly hard. The main problem is that although the Delta and Extra field are read sequentially, reads from the the old file (A) are random, meaning the first read can be from the end of the old file, and the next from the start of the old file. This is a massive problem because as we’re writing the new file, we’re erasing the content that was previously at the address, meaning we must having read all we needed from an address before overwriting its content. Any instance where this doesn’t happen is called a Write-before-Read (WbR) conflict.

Now, you may ask, why not reorder the control segments so that A are sequentially sorted? This wouldn’t really solve anything because the destination address are now randomly ordered (meaning, we may try to write the data we just read to higher addresses, that we may have not yet read). Therefore, we will still have WbR conflicts.

Now, a common solution to this problem is to look closely at the problematic operations (the ones reading from, and the one overwritting) and see if we could reorder them so that every read is performed before the write. That’s often possible, but also often insufficient as there are multiple conflict patterns that reject this solution (see below).

Cyclical WbR conflict

In this case, the typical solution is to find the smallest move of data (let’s say… D), and copy D… somewhere. Those systems usually reserve some spare space for this kind of backup, or embed D in the update payload when there is no space left.

This solution has two major issues: it is inefficient (larger updates or larger flash footprint, both of which are bad) and it doesn’t work well in more complex conflict situations (imagine if every node sends and receives data from every other node!).

A key feature of Hugin is solving the WbR conflict problem without any loss of efficiency at runtime. We’ll jump in our solution after a last bit of context.

Hugin memory constraints

As mentioned before, any code trying to write to the flash memory require a RAM buffer as large as a single flash page, in order to save the untouched content when erasing the page. Hugin was only allowed the use of this buffer, meaning it’s not allowed to explicitly use any other flash or RAM buffer (the resiliency subsystem use a little bit of flash, but the quantity is fixed and Hugin ignore its existence).

Solving WbR Conflicts

For the record, Orange applied for two patents regarding aspects of our solution. The space is quite encumbered with low quality patents so I don’t really know the scope of this one (and I no longer work there). With that out of the picture, Hugin use a slightly different delta file format, although based on BSDiff. The main difference is that before the delta data, a blob of data encoding a stream of instructions guide Munin into building the flash into a preimage: a layout of the flash where every external read (A) is moved to its destination, although not combined with the delta. Having this pre-image make installing a BSDiff patch trivial, as we will later see.

Underlying idea

The underlying idea of the algorithm Hugin is using to generate its instructions is called HSwap, and involve partially swapping the content of two pages. The algorithm is at its more obvious when considering the cyclic WbR conflict displayed a bit higher. Each link is moving some data (A-D) that may be overwriting the data leaving in the next link, and may still be necessary in the departing page (some data may get duplicated). HSwap work on cycles with at least three pages (if there are only two, a swap is trivial to implement). Each iterations start with three pages, connected through two links (let’s say, 0x1000-0x3000 and the links A and B). This section will look like the following. It is important to note that HSwap require A to move at least as much data as B.

State of the cycle before HSwap

HSwap will proceed to copy the full content of 0x2000 in our small buffer memory, fully filling it. Hugin will then generate the instructions to generate the pre-image version of 0x2000, copying in A and getting everything we want to keep from the memory buffer. This will result in the following configuration.

State of the cycle after writing 0x2000 

The buffer still contains the remain of 0x2000, which are basically the discarded sections (that we can ignore) and B (that we can’t). We will then load the content of 0x1000 in the buffer, minus A. Because A is at least as large as B, we know that we can fit it in the buffer. We can then erase 0x1000 and write the content of the buffer to it, resulting in the following configuration where A’ is the part of A that is still required by 0x1000.

State of the cycle after updating 0x1000

What is interesting in this configuration is that 0x2000 no longer has any incoming, meaning that it is now excluded from the cycle, which is now one page smaller. By simply repeating the algorithm (let’s say, once more with the links C and D), we can result in configurations similar to this.

State of the cycle after two iterations of HSwap

In this state, we can trivially solve the cycle by loading 0x1000 in the buffer, writing its final form (pulling D and A’ in). After keeping only B in the buffer, we can write anything we need to save from 0x3000 in it, then write the final form of 0x3000 (pulling B from the buffer and C’).

Extending the primitive

Although this idea neatly solve cyclical conflicts, it’s highly inefficient for networks. However, we were able to easily extend it to more complex structures. The basic HSwap is implemented here and only called when the circumstances are right. The more complex, extended, implementation is available there,  but as you can see, it’s relying on the same underlying primitive.

To make the primitive more useful, we can turn A and B into virtual links, which are basically a thoughtful analysis of which data better belong to 0x1000 and 0x2000 in order to solve the full network. Those virtual links don’t need to have a strong relationship with any real link, but are basically what our algorithm think should move. Another difference is that 0x2000 doesn’t necessarily turn into its preimage form (basically, we get a very fancy swap).

The main downside of this change is that it require a fairly peculiar data structure (a way to represent the network) and a very strong virtual memory subsystem, so that Hugin can keep track of where are the data when generating the instructions.

Hugin Implementation

Hugin first look for the simplest types of moves, that is internal moves (moving data within a page). It then check for unidirectional references, that is a non-cyclical chain of copies that could have been easily processed in traditional systems by simply reordering the copies. However, in order to keep things simples, those approaches both rely on the instruction stream to create the preimage. Things get a bit weirder afterward, when networks get processed.

Data Structure, strength and limitations

The network representation is fairly simple.  We look for pages with incoming/outgoing data and detect any page either providing, or receiving data from any node (page) of the network. This process is repeated until the full network is identified. We then create a Network object, which will create an array of NetworkToken, where moves of data are aggregated if they share the same source and destination (basically, a multi-part move between two pages). An array of NetworkNode is then built where each page receiving data has an object to keep track of the data it’s containing. Each NetworkNode also keep track of its expected preimage layout. Duplicates (multiple copies of the same data leaving to different destinations) are then stripped, and each NetworkNode look through all of its outgoing links for the most interesting to process. This structure of data is fairly simple to create and maintain, allow reasonably quick access to a given node (the Network keep an array of NetworkNode sorted by BlockID, or base address of the flash page making, access possible on O(log(n))). It is however fairly slow in accessing the incoming NetworkToken (thanks to deduplication, the only real way to get the list of incoming nodes is to go through the final layout and to resolve through the virtual memory the current location of each bit of data, and then deduct the BlockID and fetch the NetworkNode).

Another notable data structure is our virtual memory. The VirtualMemory object’s job is to maintain a TranslationTable object and the CacheMemory. Specifically, letting its callers ignore the (many) fragmentation issues.

The TranslationTable is optimized for performance, after being a massive bottleneck during profiling. Its basic structure is an array of DetailedBlockMetadata (this array is usually wrapped within a DetailedBlock), each containing a source (the current location of the data), a destination (the virtual/original address of the data) and its length. In order to make access faster, each page has its own array and are accessed through a unordered_map (making accessing the sub-array O(1)). This segmentation make segmentations issues a bit trickier but thankfully, VirtualMemory (and especially DetailedBlock) hides the complexity.

CacheMemory is our internal representation of the small in-RAM buffer. It is basically a DetailedBlock with each segments with a “tag” boolean to differentiate the segments containing important data, and some dedicated utils to perform computation on the cache.

Processing an iteration

Our algorithm run as a sequence of iterations, each processing and removing a link of the network. The processing is such that the selected link is always dissolved as a way to ensure that we don’t loop forever.

The link of each iteration is selected by going through every active NetworkNode, selecting the link with the highest score (precomputed and cached by each node, as the operation is expensive). The score try to encourage bidirectional links as they usually allow to remove two links instead of one, and look for balanced links.

Once a link is selected, both of its NetworkNode are gathered, and the situation is assessed. We’ll call N1 the node from which the link is leaving, and N2 the destination.

If the selected link is large enough, we can consider the situation as a classical HSwap, where all outgoing links from N2 can be reduced to a virtually aggregated. If this is the case, a classical HSwap can be used (if we have enough spare space, we may even write the preimage layout of both nodes, by using the unallocated space to store any data some other nodes may need from either N1 or N2). As a short-hand, we’ll call writing the preimage layout (possibly using the spare space to store other stuffs) “turning the node final”.

In most circumstances however, we will require a slightly more involved algorithm. This time, we will virtually merge both nodes, allocate the most important NetworkToken to the proper nodes (allocate any the tokens going toward N2 to N2, and doing the same to N1 after removing any duplicates with N2’s data). Duplicates are then purged from the virtual, common, node (internal, with N1, with N2). What’s left is then allocated to the two nodes (first in full, with the potential of a single NetworkToken being split). This lets us remove any duplicate destination between N1 and N2 toward any third node (besides at most one, if a split is necessary).  As an added twist, we check if we may have enough spare space in the preimage of either node to turn the node final, on top of allocating the data. This would let us quick a node out of the network, which is always appreciable. After that, we start generating the instructions used to perform the swap.

Once this is done, the general context is updated, any potentially affected node sees its cached selected link being reset and we go for a new iteration.

Code generation optimizations

The main issue of this whole approach is that we’re adding a new chunk of data to the update payload, which should then be minimized. This is achieved through multiple more-or-less interesting means, but one fairly nifty optimization also manage to reduce the number of times blocks being touched!

We will need to take a step back, and look at the overlap between two iterations, from the point of view of the code generated. Let’s call N1 and N2 the source and destination of the first link, and N3 and N4 the source and destination of the second link:

  1. The relevant content of N2 is loaded to the cache;
  2. N2 is erased;
  3. N2 is rewritten, using data from both the cache and N1;
  4. N1 is partially loaded to the cache;
  5. N1 is erased;
  6. N1 is rewritten using the content of the cache (it is now empty);
  7. The relevant content of N4 is loaded to the cache;
  8. N4 is erased;
  9. N4 is rewritten, using data from both the cache and N3;
  10. N3 is partially loaded to the cache;
  11. N3 is erased;
  12. N3 is rewritten using the content of the cache (it is now empty).

Now, if we were to look at steps 5 to 8 (the end of the previous link 5-6 and the beginning of the next link 7-8), we see a pattern… that would be very convenient if N4 was the same node as N1 🤔…

Basically, if the source of the first link is the destination of the second, we could skip operations 5 to 8, only erasing the node once! This is implemented by the VirtualMemory as cachedWrites, and by partialSwapCodeGeneration which will try to change the order of the nodes if it helps us land in this configuration. The search for the link to process also try to give a bonus to links that would let us save on a write.

Although probably not as wide as the actual field of compiler optimization, the code generation module is ripe for very extensive optimisations, going from the link selection, slightly larger in-memory buffer, smarter data layout reorganization… Thankfully, because the instruction set used to communicate between Hugin and Munin is clearly defined, those optimizations can be introduced without having to do any change to Munin/the devices in the wild.

Post processing

Because most of the operations we just mentioned significantly impact the bookkeeping, they are performed during the main step of code generation.

Once the last iteration is performed (i.e. no inter-node link is left), we look for any non-final node (nodes whose content or layout isn’t the one expected by BSDiff). This is possible if the last NetworkToken bringing data to it was deleted because of duplicated data. If one is found, a dedicated util will generate a simple code that will load the bare minimum from the page to the cache, erase the page and write its final layout. Each page impacted by the network we just processed is then marked as “cleared” and we loop back, looking for some untouched node left, in the hope of finding another network.

Once the last network processed, the instructions we generated should provide the layout BSDiff expect on all relevant sections (we don’t make any guarantee on the parts of the layout to be replaced by Extra Data). We perform some encoding optimizations before proceeding to strip any irrelevant metadata and sending back to our caller an array of PublicCommands.

Validation

Hugin does its absolute best to make sure not to generate anything that may endanger the device. Although it can’t be of much help when making sure the new firmware doesn’t move the update code, it can make sure the update is valid, even if Hugin was buggy.

After generating the BSDiff and making sure running it as is provide the proper output, we generate the instructions meant to rewrite the flash, check they can be properly encoded and decoded and then finally run them in a small virtual machine. This implementation is as close as possible of Munin’s and perform the update in full (executing the instructions, then the modified BSDiff). This infrastructure was the main tool used during Hugin’s development to validate changes (and the time before was sometimes painful). Our process was to try to generate updates between various random binaries, assuming that if there was any issue, this subsystem would catch it (and to the best of our knowledge, it did).

Distribution

Once the update is generated and validated, we can proceed to generating the Update Payload. The format is fairly simple, and optimized for sequential reading:

  • First, the commands are encoded and written at the beginning of the file. A final instruction is appended in order to signal the end of the stream, and is repeated in order to pad the last byte;
  • A magic value (confirming the end of the instruction stream) and the address at which we have to start patching are then written;
  • A field of compressed data follows:

    • Starting with the number of “BSDiff segments”;
    • An sequence of segments, repeating the same pattern:

    • After the BSDiff segments, we have an array of memory ranges coupled with the expected hash, in order to validate that the update went properly. It starts with the number of memory ranges;

    • Then repeat the same pattern, with ranges inferred from the BSDiff update.

The data is compressed using a custom compression algorithm I will write a dedicated post about.

In addition to the Update Payload, we extracted the expected signature of the ranges we will read from (the Integrity Check) and write it to a dedicated file so that Odin may process it in a simpler fashion.

Conclusion

Congrats on getting to the end of this article, it was quite of a heavy one. I hope this article gave you an overview of WbR conflicts, how were they solved before, and how we did it. Moreover, I hope the description of our code generation and the way we validate our output is clearer. If I was at any point confusing, feel free to fire me a word on Twitter.