A Visual Guide to Pointer Analysis with cclyzer++: Part 1

Pointer analysis is a foundational static analysis with applications to the problems of program optimization, verification, bug finding, and many others. Galois recently released cclyzer++, a scalable and precise open-source pointer analysis for languages that compile to LLVM, especially C and C++. cclyzer++ is based on the cclyzer project, which was authored by George Balatsouras and Yannis Smaragdakis. This two-part blog post series visually illustrates the essence of cclyzer++ by means of examples.

At the end of this series, you'll be able to read simple LLVM programs (unoptimized, compiled by Clang from simple C programs). You'll be prepared to read and understand the output of a pointer analysis for LLVM, and to evaluate how precisely it approximates the behavior of the corresponding program. In particular, you'll be able to explain how cclyzer++ produces the following graph from a program that uses linked lists and what the graph says about the behavior of the program.

For more information about cclyzer++, see the project documentation.

Why Pointer Analysis

Pointer analysis is necessary for accurate static approximation of inter-procedural control- and data-flows of pointer-manipulating programs. As a result, many analyses and applications that depend on inter-procedural reasoning rely on pointer analysis (or would be greatly improved if they did), such as:

  • Program understanding
    • Visualizing a program’s call-graph
    • Listing the callers of a function
  • Security
    • Static taint tracking
    • Whole-program static analysis, e.g. to find potential memory corruption
  • Program optimization
    • Alias analysis
    • Automatic parallelization
  • Program verification
    • Construction of scalable memory models

Abstraction

The goal of a pointer analysis is to compute the points-to relation: a mapping from program variables and allocations to the set of all allocations they may point to when the program is executed. However, the set of stack and heap allocations made by a program at runtime can be unbounded. In the following C program, it's impossible to tell how many times the while loop and its call to malloc will be executed.

int main() { 
    while (rand() % 2 == 0) { 
        char *z = malloc(1); 
        f(z);
    }
    return 0; 
}

To ensure termination and scalability, cclyzer++ abstracts sets of allocations by their allocation site. In a C program, an allocation site can be:

  1. A global variable
  2. A stack allocation (e.g., a call to alloca or a local variable which has its address taken with &)
  3. A heap allocation (e.g., a call to malloc or realloc)

In this blog post, we’ll label abstract allocations like so (this is a somewhat arbitrary choice, and is slightly simpler than the way cclyzer++ actually labels allocations):

  • global@global_var for a global variable global_var
  • stack@main[y] for a stack allocation in function main of a variable y
  • heap@main[z] for a heap allocation in function main assigned to a variable z

The points-to relation then consists of a mapping from variables and abstract allocations to sets of abstract allocations that over-approximate the points-to relations that can hold during all program executions.

Visual Grammar

To illustrate the analysis algorithm, we can draw the points-to relation as a graph. Variables are drawn in ellipses, allocations are drawn in boxes, and edges are the "points-to" relation.

C and LLVM

LLVM is a low-level programming language used internally by several compilers including Clang and rustc. cclyzer++ analyzes LLVM code, but don't fret if you're not already familiar with LLVM - all the examples are shown first in C, relevant LLVM concepts will be explained along the way, and LLVM is at a comparable level of abstraction with C in the first place.

Allocations

Allocation sites are the starting point for the analysis. In brief, a variable at an allocation site points to the abstract allocation corresponding to that allocation site. Let's see some examples:

Stack Allocations

Consider the following C program:

int main(int argc, char *argv[]) {
    int z;
    // ... address taken somewhere with "&z" ...
    return 0;   
}

Assuming this program takes the address of z, the compiler has to allocate it on the stack. At the LLVM level, this happens with the alloca instruction:

define @main(i32 %argc, i8** %argv) {
entry:
    %z = alloca i32*, align 8
    ret i32 0
}

(In LLVM, local variables all start with %, whereas globals and function names start with @. Note also that on x86_64, the C type char compiles to the LLVM type i8, and int to i32.) The points-to graph of this LLVM program has just the one variable z, and it points to its stack allocation:

Heap Allocations

It's a similar story for heap allocations.

#include <stdlib.h>
int main(int argc, char *argv[]) {
    char *z = malloc(1);
    return 0;
}

Here's what the call to malloc looks like at the LLVM level. Mostly, everything's just a tad more explicit.

define @main(i32 %argc, i8** %argv) {
entry:
    %z = call i8* @malloc(i64 1)
    ret i32 0
}

Again, the points-to graph just has the variable z, and it points to the abstract allocation corresponding to the call to malloc.

Global Allocations

Finally, let's look at global variables.

int z;
int main(int argc, char *argv[]) { return 0; }

At first glance, it might not be clear that there are any pointers at all in this program (other than argv, which we'll ignore). However, all global variables are pointers at the LLVM level. It's not obvious from the code, but the expression @z has type i32*:

define @main(i32 %argc, i8** %argv) {
entry:
    %z = call i8* @malloc(i64 1)
    ret i32 0
}

Thus, every global variable has a corresponding global variable allocation that it points to.

Casts

C allows us to cast between pointers with different types (modulo certain restrictions). For instance, we can cast an int* to a void* to pass it to memset:

#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]) {
    int *x = (int *) malloc(sizeof(int));
    memset((void *) x, 0, sizeof(int));
    return 0;
}

At the LLVM level, these casts become bitcast instructions:

(The C-level memset compiles to the strange function @llvm.memset.p0i8.i64, but it’s safe to ignore that detail for our purposes.) Just like a C cast, bitcast doesn't alter its argument at all; the output of a bitcast points to whatever the input pointed to.

Stores

Stores to Heap Pointers

Take a look at the following program, which stores a pointer to another pointer:

#include <stdlib.h>
int main(int argc, char *argv[]) {
    char **x = malloc(sizeof(char*));
    char *y = malloc(sizeof(char));
    *x = y;
    return 0;
}

At the LLVM level, stores like *x = y become store instructions:

define i32 @main(i32 %argc, i8** %argv) {
entry:
    %x1 = call noalias i8* @malloc(i64 8)
    %x2 = bitcast i8* %x1 to i8**
    %y = call noalias i8* @malloc(i64 1)
    store i8* %y, i8** %x2, align 8
    ret i32 0
}

Here's what the points-to graph would look like just before the store instruction is executed:

After %y gets stored to %x2, the allocation that %x2 points to (that is, heap@main[%x1]) will point to whatever %y points to (that is, heap@main[%y]). Visually:

Multiple Stores to the Same Pointer

In all of the examples we've seen so far, the analysis has been pretty precise - each variable and allocation pointed to at most one allocation. Multiple stores to the same allocation result in more complex points-to graphs.

#include <stdlib.h>
int main(int argc, char *argv[]) {
    char **x = malloc(sizeof(char*));
    char *y = malloc(sizeof(char));
    char *z = malloc(sizeof(char));
    *x = y;
    *x = z;
    return 0;
}


define i32 @main(i32 %argc, i8** %argv) {
entry:
    %x1 = call noalias i8* @malloc(i64 8)
    %x2 = bitcast i8* %x1 to i8**
    %y = call noalias i8* @malloc(i64 1)
    %z = call noalias i8* @malloc(i64 1)
    store i8* %y, i8** %x2, align 8
    store i8* %z, i8** %x2, align 8
    ret i32 0
}

Here's what the points-to graph looks like before both store instructions are executed:

After both store instructions:

This result isn't completely satisfying; at runtime, the concrete allocation represented by heap@main[%x1] can't actually point to both heap@main[%y] and heap@main[%z] at the same time. This is yet another form of abstraction that's used to retain scalability - cclyzer++ is flow-insensitive, meaning it doesn't precisely model the control flow of the target program. It computes a single points-to relation for all points in the program, rather than one points-to relation for each point in the program.

Stores to Globals and Modeling Null

Finally, let's look at a store to a global variable:

#include <stdlib.h>
char *g = NULL;
int main(int argc, char *argv[]) {
    char *x = malloc(sizeof(char));
    g = x;
    return 0;
}

@g = global i8* null, align 8

; Function Attrs: noinline nounwind uwtable
define i32 @main(i32 %argc, i8** %argv) {
entry:
    %call = call noalias i8* @malloc(i64 1)
    store i8* %call, i8** @g, align 8
    ret i32 0
}

The global variable is initialized to null. cclyzer++ models null as a special allocation that participates in the points-to relation like any other allocation would (up to some special handling here and there). null never points to any other allocation because it can’t be stored to nor loaded from. Here's the points-to graph before the store:

After the store:

Loads

Let’s look at loads, that is, dereferencing pointers.

#include <stdlib.h>
int main(int argc, char *argv[]) {
    char **x = malloc(sizeof(char*));
    char *y = malloc(sizeof(char));
    *x = y;
    char *z = *x;
    return 0;
}

At the LLVM level, loads like *x become load instructions. Appending to our first store example:

define @main(i32 %argc, i8** %argv) {
entry:
    %x1 = call i8* @malloc(i64 8)
    %x2 = bitcast i8* %x1 to i8**
    %y = call i8* @malloc(i64 1)
    store i8* %y, i8** %x2, align 8
    %z = load i8*, i8** %x2, align 8
    ret i32 0
}

As shown above, the points-to graph just after the store looks like this:

Loading from an allocation "undoes" storing to it - it retrieves whatever was previously stored. Think of each allocation as a box. Stores put items into the box, and loads take out whatever was previously put in. In this example, loading from %x2 retrieves the pointer stored in heap@main[%x1], which points to heap@main[%y]. This pointer is assigned to the variable %z.

When an allocation points to multiple allocations, all of them get retrieved by a load; everything that was put in the box gets taken back out again. Extending the program above with multiple stores,

#include <stdlib.h>
int main(int argc, char *argv[]) {
    char **x = malloc(sizeof(char*));
    char *y = malloc(sizeof(char));
    char *z = malloc(sizeof(char));
    *x = y;
    *x = z;
    char *w = *x;
    return 0;
}

the points-to graph would look like this:

Conclusion

You've now seen most of the essential features of cclyzer++. The next post will explore some more advanced topics and bring it all together with a larger example.