In this post, we describe a new memory model that we have added to our Macaw binary analysis framework which dramatically improves its performance when dealing with large binaries. Galois continues to invest in our binary analysis tools because they address a significant problem: many developers distribute closed source binaries that cannot be analyzed with source-level static analysis techniques. Unfortunately, Macaw’s previous memory model hamstrung its performance, impeding its ability to reason about moderately sized programs. With its new memory model, Macaw further empowers users to glean information about real world programs for which they do not possess source code.
Macaw is a binary analysis framework with two key features: binary code discovery and symbolic execution. For the purposes of this blog post, we are primarily concerned with the second. To facilitate symbolic execution, Macaw consumes machine code and lifts discovered code into an intermediate representation (IR) suitable for symbolic execution via the Crucible library. Macaw supports x86-64, PowerPC (32 and 64-bit), 32-bit ARM, and RISC-V binaries.
Lifting machine code to Crucible’s IR requires Macaw to capture the semantics of the underlying computing substrate that a given binary executes on, including architecture-specific and operating system-specific details. For example, each architecture has a different method for invoking a system call, and each operating system possesses different sets of system calls and semantics for those calls. One such detail that Macaw must capture is a memory model. A memory model details the layout of memory, as well as the semantics for accessing that memory, including access control rules.
On loading a binary, Macaw maps the binary’s static memory into a single symbolic SMT array. An SMT array is an infinitely long indexed stream of data managed by an SMT solver. Given that an SMT solver is managing the array, the individual cells can contain symbolic values.
To initialize static memory, Macaw first creates a new SMT array to serve as a mapping from memory addresses to bytes. This new array contains an infinite sequence of symbolic bytes. For each address in the binary, Macaw generates an assertion to the SMT solver that the byte in the SMT array at that address matches the corresponding byte in the binary. To improve performance, Macaw generates a single conjunction of these assertions to send to the SMT solver rather than performing back-and-forth communication with the solver for every single byte. We refer to this as the eager memory model because all of the SMT array initialization happens before simulation begins.
During simulation Macaw may read and update the SMT array by querying the SMT solver. Importantly, the indices Macaw uses in these queries may themselves be symbolic, allowing Macaw to support symbolic pointers.
To illustrate how the memory model works, consider the following simple program that modifies a global variable based on random data:
On loading the binary containing this program, Macaw initializes an SMT array containing all memory from the program. The diagram below shows Macaw generating assertions that memory from the binary matches corresponding memory in the infinite SMT array. For simplicity, we have split the binary into 1024 byte chunks with memory starting at address 0x0 and ending at address 0x2000.
The next diagram demonstrates how Macaw interacts with the SMT solver during simulation of update_glob to reason about memory. To simplify understanding of interactions with the SMT solver, we introduce two functions: get and set. get takes an address and returns the byte at that address. set takes an address and a value and sets the byte at that address to the supplied value. The bytes set accepts and get returns may be concrete or symbolic.
While the above approach works well for small binaries, performance suffers as binaries grow in size. The main issue is that initializing large chunks of read-only memory into an SMT array is very slow, requiring a huge number of updates to the array (one per byte). The optimization to limit interactions with the SMT solver by forming a conjunction of SMT array specifications helps, but is insufficient as binaries approach megabytes or more in size. This problem does not only affect single large binaries; Macaw must also load read-only memory from any shared libraries, so even small binaries that dynamically link with a few small shared libraries quickly overwhelm Macaw’s ability to initialize static memory.
We solved this problem by developing an alternative, lazy memory model. In contrast to the eager memory model, which writes the entire contents of global memory to an SMT array before simulation begins, the lazy memory model improves performance by deferring much of the work to simulation time.
The single largest performance bottleneck in the eager memory model is the sheer number of updates to the SMT array that happens before simulation begins. We obtain a substantial speedup in the lazy memory model by minimizing SMT array updates as much as possible. This is made possible by observing that although a binary’s global address space might be very large, a typical execution path through the binary’s machine code will only access a fraction of the address space. As such, we can get away with only populating the parts of the SMT array needed to support the addresses that are actually accessed in practice.
How do we determine which parts of the address space are actually used? One way to do this is via static analysis of the machine code. This approach could potentially be expensive, however. Perhaps not as expensive as the eager memory model, but ideally, we would prevent users from having to wait a long time before simulation begins.
Instead, we defer this task until simulation time. At the start of simulation, we do not write anything to the SMT array, and we keep track of which regions of memory have been visited during simulation. Whenever Macaw is about to simulate an address in a memory region that has not yet been visited, we first update the portion of the SMT array corresponding to that region and proceed with simulation.
Consider the update_glob program from earlier. Let’s suppose that the glob variable lives at the address 0x0250. Right after writing the value of glob, the SMT array backing memory will look something like this:
We have updated the part of the SMT array corresponding to glob’s value, but nothing else. If we assume that the rest of update_glob does not need to access any other parts of global memory, then this means that this is the only SMT array update that is required.¹ This results in huge performance gains, as we do not have to wastefully populate parts of the SMT array that we will never use.
We can further reduce the number of SMT array updates needed by observing that not all memory accesses need to be backed by SMT. Whenever mutable memory is involved, we have little choice but to involve SMT arrays, as we need a way to track the (possibly symbolic) changes to memory over time. This is not the case for immutable memory, however, as the contents of this sort of memory cannot change. Therefore, reading from immutable memory does not require involving SMT arrays at all.
We avoid a substantial number of SMT array updates in the lazy memory model by adding a special case for reads from immutable sections of the global address space, such as the .rodata section (read-only data) of an ELF binary. Whenever such a read is encountered, we instead return the raw bytes from the original section in the binary, which is stored in Macaw as a separate data structure. Storing the global address space on the Macaw side proves much more performant than storing the same information in an SMT array.
One caveat to this special case is that reads must be fully concrete. That is, if Macaw attempts a symbolic read, where the range of addresses to read from involves symbolic variables, then it is difficult to determine if the address range fully lies within an immutable section of global memory. As a result, symbolic reads must fall back to a slow path: Macaw will populate all parts of the SMT array that could be read from (if they haven’t already been populated), and then Macaw will perform the read by muxing over the different parts of the array. Thankfully, symbolic reads are typically much rarer in a Macaw setting than concrete ones.
In order to check if a memory read refers to global memory or not, we need a reliable metric for determining if the pointer being read from is actually backed by global memory. In Macaw, every pointer has a region number that tracks which memory allocation it belongs to, and global memory is backed by a single allocation. Therefore, checking if a pointer is backed by global memory is tantamount to checking if its region number matches the region number of the global memory pointer.
Checking if the region numbers match is not entirely straightforward in a symbolic simulation setting, however, as one of the region numbers could be a symbolic term. Our experiments uncovered a common pattern in which one side of the equality check is a concrete number, but the other side is a somewhat complicated symbolic expression. In order to check if these are equal, we must consult an SMT solver, and repeatedly consulting an SMT solver for every single memory read is a recipe for poor performance.
While a pointer might have a symbolic region number, it is often the case that the path conditions on the symbolic expression forces it to be equal to a single concrete value. For instance, imagine if a pointer’s region number x had the conditions x ≤ 42 and 41 < x. While x is symbolic, its path conditions are only satisfiable if x is equal to 42. This means that x and 42 are effectively interchangeable. The latter is far preferable from a performance standpoint, as we do not need an SMT solver to figure out what its value is.
As an optimization, the lazy memory model concretizes pointers that have symbolic region numbers. That is, it will consult an SMT solver to determine if the symbolic expression is concretely equal to exactly one value. If so, it will replace the symbolic expression with the concrete one. This means that any subsequent reads from that pointer will not require any SMT interactions, thereby amortizing the cost of checking for global memory.
Note that if a pointer has a truly symbolic region number—that is, one that can be equal to multiple concrete values—then we will still have to consult an SMT solver every time that we read it. Fortunately, our experiments suggest that this is a somewhat uncommon occurrence, so the performance impact is generally negligible.
The key insight that makes the lazy memory model fast is to only pay for what you need. The eager memory model pays for everything up front, while the lazy memory model tries to only pay for the specific regions of global memory that are accessed during simulation. The question is: at what granularity should we divide up the global address space?
Because we are working in an ELF setting, our first attempt was to split up an ELF binary into its individual sections, such as .data, .bss, .text, etc. Whenever a mutable portion of an ELF section is accessed for the first time, we would then populate the portion of the SMT array that corresponds to the section.
Unfortunately, this wasn’t enough to solve our performance woes. We discovered that ELF binaries often have many small sections (on the order of a couple bytes or so) alongside a couple of very large sections that take up most of the address space. This means that whenever simulation accesses a very large section for the first time, we would end up populating most of the SMT array anyway, so splitting up the binary by ELF sections wasn’t sufficient.
Our solution was to split up the address space even further. In particular, we now divide each ELF section into 1,024-byte chunks, and we populate the SMT array one chunk at a time instead of an entire section at a time. This has proven to work well in practice, as many binaries only need to access a fraction of the chunks that make up these large ELF sections.
Implementing the lazy memory model in Macaw was key to making it possible to symbolically execute large binaries. In our case, we had a binary 1.8 megabytes in size that also depends on 2.9 megabytes worth of shared libraries. When we attempted to symbolically execute this binary with the eager memory model, we would have to wait hours to initialize the entire SMT array before simulation began, which was unacceptably slow. When we used the lazy memory model, on the other hand, we were able to start simulating the machine code almost instantly.
While the lazy memory model is clearly a better choice than the eager one for binaries like this one, we should be careful to note the drawbacks. Fundamentally, the lazy memory model is sacrificing some additional space in exchange for faster simulation times. The lazy memory model must track two additional pieces of information on the Macaw side that the eager memory model does not: it must remember the initial state of the global address space, and it must track which regions of the SMT array have already been populated. The upside is that less information needs to be written to an SMT array, which usually ends up being the biggest performance bottleneck.
As noted earlier, we were unable to come up with a satisfactory explanation of why eagerly writing megabytes of information upfront to an SMT array performs so abysmally. It is entirely possible that we are not using SMT arrays optimally and leaving performance on the table as a result. If this is indeed the case, then it may be possible to tweak the eager memory model’s underlying SMT setup to improve its performance.
Lazily populating memory in Macaw is one instance of a longer-term effort to improve the scalability of our program analysis tools on very large inputs. Prior to this work, we introduced an incremental code discovery algorithm to Macaw that avoids wastefully discovering function entry points that are never used during simulation. In our Crux and SAW tools, which work over LLVM code instead of machine code, we have implemented a similar technique for incrementally translating control-flow graphs. In the future, we plan to parse LLVM bitcode incrementally, as parsing large bitcode files is a notable performance bottleneck in Crux and SAW.
The latest version of Macaw includes both the eager and lazy memory models in Macaw as user-configurable options. The source code for Macaw is available on GitHub here, and you can read more about Macaw and its overall design here.
—
¹ This is an overly simplified view of things, as calling rand might access other parts of global memory in libc. We’ll ignore this aspect of things for the sake of this post.