Funded by DARPA’s SafeDocs program, this project aims to achieve three major goals: design a novel language for precisely describing practical formats, design a framework for learning formal descriptions from extant documents, and enable the generation of correct and performant format parsers.
We exchange information every day through widely-used document formats such as PDF, DOCX, XLSX, and between Internet of Things devices using DDS streaming formats. The popularity of these file formats, combined with the lack of clear and explicit descriptions of which documents are actually in them, makes them highly susceptible to malware exploits, such as the “macroless” DDE Exploit that recently affected Microsoft Word documents.
Attackers deploy malware exploits for these files by tricking ubiquitous document processors, called parsers, into believing that malicious files are harmless documents, deceiving the parser into either validating a malicious file as safe; or accidentally carrying out the attack in the act of attempting to process the file. These attacks get through because the specification of what documents are safe is typically left implicit. Parser programmers must constantly work to track what properties of a document have indeed been validated as a parser executes, and if these properties suffice to carry out a desired action. In practice, this is an overwhelming task.
Over the course of this project, we will provide users with trustworthy formats by:
Many of the practical formats that we rely on evolved to be usable by becoming highly flexible. For example, a PDF might simply contain formatted text, but it might also contain pictures, videos, and JavaScript code. The cost of this flexibility is that precisely describing exactly which documents belong in a format becomes extremely challenging. For example, simply knowing how big one part of a document should be may require a deep inspection of the content in an earlier part of the document. A major result of this project will be a data-description language, named DaeDaLus, that will enable format users, including ones without experience in programming, to understand specify practical formats precisely. The key features of DaeDaLus design and toolchain will include:
DaeDaLus will be supported by a toolchain that will enable format users to quickly obtain DaeDaLus descriptions of the document formats that they rely on. To obtain a complete description of a format in DaeDaLus, a format user will be able to provide a corpus of existing format documents to a powerful tool, called LearnDDL. LearnDDL will automatically generate a DaeDaLus format that naturally describes the existing format by employing sophisticated techniques from machine learning.
Optionally, the format user will use other tools in the framework to learn normalizations from relatively complex formats that describe formats as they are used to simpler formats that can be more easily understood by humans and parsers alike (e.g., by removing redundant whitespace symbols).
Critically, the DaeDaLus formats, including those generated by LearnDDL, will be a declarative, complete, and precise description of which documents are in a format. But for format users to be able to work with documents in a format, they still need a parser to process format documents.
Developers will obtain correct, high-performance format parsers by using the Safe PARser Toolkit for Assurance (SPARTA). SPARTA will include a Domain Specific Language (DSL) designed specifically for developing parsers, named Pickle. While programmers will be able to develop Pickle parsers completely from scratch, SPARTA will also enable them to generate at least major parser components from DaeDaLus formats automatically. We expect a typical usage of the Pickle toolchain to develop a parser for a DaeDaLus format will proceed as follows:
DaeDaLus and SPARTA aim to help put an end to issues caused by unclear file formats and unsafe parsing. The frameworks, toolkits, and toolchains we are developing will be designed for customizability while taking advantage of new advances in automation and machine learning. If successful, systems built using these implementations will be invulnerable to a wide class of malware attacks.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-C-0073. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).