At Galois, we’re always looking for better ways to build reliable systems. A big part of that is doing our own research. At 130 people, we’re one of the world’s largest formal methods research teams! But another part of our work is understanding new tools and techniques from the research community. We’re interested in solving problems using any approach that works, including learning from other formal-methods researchers. Of course, it can be difficult to know how well a technique will work in practice from reading a research paper, so we often test new techniques out ourselves to learn more about them.
One area that we’re very interested in at the moment is black-box testing of network protocol implementations. Two important examples are the ubiquitous TLS protocol and QUIC, a new transport-layer protocol under standardization at the IETF that will underlie HTTP/3. These protocols are critical for security but are notoriously difficult to get right, and implementations of TLS have often suffered from security vulnerabilities in the wild. To make matters worse, network protocols are inherently difficult to test because exercising a bug might require that several nodes play out a long sequence of interactions.
Traditionally, protocol developers use interoperability testing, in which, e.g., a server running implementation A of QUIC is tested against a client running a different implementation B. By running the protocol between the two implementations, long sequences of complex interactions can be generated. This approach also provides a limited sort of adversarial testing, as implementation B might have interpreted the QUIC specification differently, resolved ambiguities differently, or might behave erroneously. The problem with this approach is that it is very hard to achieve a high coverage of the QUIC specification. Doing so depends on the implementation of B, the precise sequence of network operations, and other contingent factors. Testing erroneous behaviour is even more challenging.
That’s why we were excited to see two papers from Ken McMillan and Lenore Zuck which use some pretty clever ideas from formal methods to address the shortcomings of interoperability testing:
McMillan and Zuck develop a testing tool built on top of the Ivy formal verification framework that uses specification-based testing to test much more diverse scenarios. In their approach, McMillan and Zuck develop a formal model of the IETF QUIC specification and use it to randomly simulate network nodes that send packets to the implementation under test. For example, a QUIC server implementation would be tested by randomly simulating a QUIC client (or clients) using the formal model. Compared to interoperability testing, the randomized, simulated client produces much more diverse stimulus than the concrete implementation B for testing the server A. McMillan and Zuck find a bunch of interesting bugs in implementations of QUIC, including vulnerabilities to denial-of-service attacks and an information leak similar to the famous Heartbleed bug. Sounds pretty cool!
So we were curious what it would be like to apply this new tool ourselves. There’s a lot of great formal methods research happening at the moment, but in our experience, there can be a gap between research and what’s needed for a practical engineering tool. Academic researchers are focused on pushing the frontiers, which means that research tools sometimes have rough edges that make them difficult for new users. Research tools can also be specialised to a particular use-case in a way that makes them hard to generalise. We wanted to replicate McMillan and Zuck’s results and understand how useful their approach would be for testing other QUIC implementations.
The good news is that the tool worked as we’d hoped! We applied it to Picoquic, a simple QUIC implementation, and even managed to find a non-conformance problem in the version we tested. That’s a good indicator that this approach could be practical in the future. That said, this is definitely a research tool rather than something ready for deployment to customers. Using it currently requires a deep understanding of the Ivy language, the QUIC protocol, and the target implementation. We’re confident that these aspects of the tool could be improved with more engineering work. So, in general, the results of our investigations were very positive, and we’re excited to apply this idea ourselves in the future.
If you are curious to try it out, we packaged our experiment using Docker for easy replication and made it available at https://github.com/GaloisInc/ivy-quic-testing-demo.
Let’s dig a bit more into McMillan and Zuck’s approach and see what we find.
Ivy is a general-purpose formal verification tool, but McMillan and Zuck use it specifically for black-box testing of protocols. Ivy’s testing of QUIC works by analysing events visible on the wire rather than any internal events of the implementation. This means Ivy only needs to be able to communicate through the network with the implementation under test and requires no source-code access or instrumentation. The testing tool masquerades as just another peer on the network, which makes it easy to deploy.
Ivy models are written in the Ivy modeling language, which is approachable to someone who has programmed, e.g., in C or Python. Models in Ivy are written using imperative (i.e., C-like) statements and pre/post-conditions which are expressed in first-order logic. These models are very flexible and can specify the behaviour of protocols with unbounded numbers of peers, unbounded data, unbounded sequences of network packets, etc. Ivy translates those models into the language of the Z3 SMT solver to generate tests and check for compliance.
The QUIC model consists of a monitor, which is an (infinite) state-machine which monitors the events visible on the wire: each wire event is associated with a precondition, which stipulates in which states of the monitor the event is allowed to take place and an update of the monitor’s state. A sequence of packets observed on the wire is correct if, when we run the monitor on each event in order, no precondition is ever violated.
This modeling style enables what is called assume-guarantee testing1 of network components. For example, to test a server, we will check that, assuming that the packets sent by clients never cause a precondition violation (the assumption), then the packets sent by the server never cause a precondition violation (the guarantee).
In practice, a test of a QUIC server might proceed as follows:
Upon failure, Ivy displays a high-level event trace and the location of the assertion that was violated.
Since the model is used both to generate packets to send the implementation under test and to check that the responses are correct, a failed test may reveal not only a) bugs in the implementation, but also cases in which the model is b) too strong or c) too weak. The model is too strong when a precondition is too restrictive and fails on a packet that is intended to be legal by the authors of the IETF specification; if a component under test generates such a packet, a spurious precondition violation will occur. Conversely, the model is too weak when it can cause a packet that is not intended to be legal to be generated during a test; such a packet is likely to be rejected by the component under test, cause it to crash, or behave unexpectedly. In both cases, we are likely to get an error trace that can be inspected.
In other words, protocol testing with Ivy allows us to test both the implementation and the model at the same time! This is important because, like programming, developing models is an error-prone process and models must be debugged too. This means that a user can begin with a candidate model that is gradually refined at the same time that bugs are fixed in the implementation under test. One note of caution is that when a test failure occurs and Ivy displays an error trace, figuring out whether we are in situations a), b), or c) is a manual process of interpretation, requiring close inspection of the trace, the Ivy model, and the IETF specification.
Because the specification expresses the whole protocol for a network of any size, testing can begin any component, and any size of network can be tested. The user can add logical conditions that restrict (or disable) protocol steps, thereby narrowing down the set of interactions produced. Conversely, test-generation can be made more permissive by removing logical conditions: this has the effect of allowing more interactions. The specification can also be modified to allow invalid requests in order to test the component for robustness against bugs or security attacks.
Finally, we should note that Ivy’s protocol-testing facility is not specific to QUIC and can be applied to other protocols. However, Ivy comes with a QUIC specification (based on IETF draft 23) and a configuration to use it to test sample QUIC implementations (Quant and Picoquic).
According to the QUIC specification, only packets containing ACK-eliciting frames should be acknowledged with an ACK. An ACK frame itself is not considered ACK eliciting, as this can result in an ACK loop. The Ivy QUIC specification did not specify this requirement, we so added it ourselves. Then, we tested the Picoquic server using one of the predefined test scenarios that come with the Ivy distribution. In this test, Ivy simulates a client downloading an HTML file from the Picoquic server under test. In each run, Ivy picks at random a behavior of the client that is consistent with the Ivy QUIC specification, thus exposing the server to a wide range of client behaviors. This exposed a case in which the Picoquic server acknowledges a client packet that contains only an ACK frame, which violates the specification (note that we used an old version of Picoquic, written against IETF draft 23, because the Ivy QUIC specification follows IETF draft 23). Our results can be replicated using the Docker setup hosted at https://github.com/GaloisInc/ivy-quic-testing-demo
At Galois, we’re big fans of formal verification, which is the gold standard of systems assurance. But for many projects, developing a formal proof would be too expensive. At the same time, it can be difficult to test a complex system such as a network protocol. The exciting thing about McMillan and Zuck’s work is that it suggests we can have some of the best of both worlds. By using a formal model, we can systematically test a very complex protocol implementation without the cost of formally verifying the whole codebase. We’re really happy with the results we got in this preliminary study, and we hope to see a lot more research on this type of compositional testing in the future.
1Giannakopoulou, Dimitra, C. S. Păsăreanu, and Colin Blundell. "Assume-guarantee testing for software components." IET Software 2.6 (2008): 547-562.