We have developed a technique to make sense of change information from a typical software project's history. The core of our approach is to treat the program text as a tree, to find differences in the tree structure, to group similar differences together, and then finally to extract a pattern that represents each group.
Jason Dagit presented our work at the DChanges workshop in September, part of the DocEng conference. The paper is available from the workshop site, along with the full workshop proceedings. The slides from the talk are also available.
The problem we looked at is simply stated:
What does the change history of a piece of software as represented in a source control system tell us about what people did to it over time?
Anyone who has worked on a project for any substantial amount of time knows that working on code isn't dominated by adding new features -- it is mostly an exercise of cleaning up and repairing flaws, reorganizing code to make it easier to add to in the future, and adding things that make the code more robust. During the process of making these changes, we have often found that it feels like we do similar things over and over -- add a null pointer check here, rearrange loop counters there, add parameters to functions elsewhere. Odds are, if you asked a programmer "what did you have to do to address issue X in your code," they would describe a pattern instead of an explicit set of specific changes, such as "We had to add a status parameter and tweak loop termination tests.
We started with some work Matt Sottile had developed as part of a Department of Energy project called COMPOSE-HPC where we built infrastructure to manipulate programs in their abstract syntax form via a generic text representation of their syntax trees. The representation we chose was the Annotated Term form used by the Stratego/XT and Spoofax Workbench projects. A benefit of the ATerm form for programs is that it allows us to separate the language parser from the analyzer -- parsing takes place in whatever compiler front end is available, and all we require is a traversal of the resulting parse tree or resulting abstract syntax tree that can emit terms that conform to the ATerm format.
To show the idea at work, we used the existing Haskell language-java parser to parse code and wrote a small amount of code to emit an ATerm representation that could be analyzed. We applied it to two real open source repositories -- the one for the ANTLR parser project and the Clojure compiler. It was satisfying to apply it to real repositories instead of contrived toy repositories -- we felt that the fact the idea didn’t fall over when faced with the complexity and size of real projects indicated that we had something of real interest to share with the world here.
What can you learn from your software history?