By and large, when people share a document (whether it’s formatted text, visual media, or some combination), they believe that when they both look at it, they’re seeing the same thing. Modern users have arguably been taught/coerced into tolerating small differences (color pallets that are varying degrees of rich-to-washed-out, poop emojis with browser-dependent swirls), but in the large, we expect that when we pull up the same file as someone else, we’re looking at the same thing.
The fact that this often holds is somewhat of a minor miracle, given that modern users may be using completely different software stacks, including operating systems, programming languages, and application codebases. Nevertheless, it’s often achieved because
A ready example of such a format ecosystem is the Portable Document Format (a.k.a. “PDF” to basically everyone). PDF has approximately 60 viewers across actively used platforms [1]; its formal definition is defined by the international standard family known as ISO 32000, with the latest edition being ISO 32000-2:2020 published in December 2020 [2]. The PDF standard is formally developed in the consensus-based, open forums that ISO processes require and that is coordinated and supported by the PDF Association [3], the peak industry body for PDF. The PDF Association also provides a public errata process [4] for capturing and providing feedback to implementers.
Despite PDF’s ubiquity and longevity, its readers, writers, and the standard are continuously (if not always dramatically) updated to account for various flaws and inconsistencies. In this post, we’ll describe a relatively major instance of such an inconsistency, which we helped to unearth while developing new tools for defining formal document standards and analyzing extant document processors in the large.
Even though document viewers largely are able to show different people the same thing when they open the same document, there’s plenty of good reasons for things to go wrong. This stems partly from the fact that a well-written standard will typically strive to place as few requirements on a document as possible, in the interest of not overly constraining how document writers and readers work under the hood. This leaves plenty of room for different viewers to get creative in how they process unspecified features. Even when a standard is quite clear about content that a document must not contain, viewers are sometimes compelled to handle it anyway: no one wants to have the PDF viewer that refuses to open a document when there’s another option around that seems to do something reasonable.
Second, it’s extremely hard to write a practical format specification that methodically and completely defines whether every possible sequence of bytes is in a format, and how each document in the format should be rendered. The current PDF 2.0 standard weighs in at 1,000 pages of English, and covers everything from the fundamental structure of a PDF’s pages to graphics, text, fonts, and multimedia (and much, much more). PDF 2.0 initially took over nine years to create after the fast-track adoption of Adobe’s PDF 1.7 specification as the original ISO 32000-1:2008 standard for PDF. After PDF 2.0 was first published in 2017 as ISO 32000-2:2017, it was then updated just three years later to reflect implementation experience, errata, and minor technical changes occurring in related technical areas. With two formal meetings every year, supported by multiple ad hoc discussion groups working on specific tasks and project leaders performing editing between each meeting, PDF 2.0 has been continuously updated and reviewed over a total of 13 years as various implementers consult it, note incompletenesses, and propose solutions. But there’s nothing like a guarantee that the current version has all ambiguities removed once and for all: if it ever had such a guarantee, it would be nothing less than historic.
Recently, as part of a multi-year research effort to improve document security (more on that below), Galois helped the PDF Association to find an ambiguity in the standard that in principle can, and in practice does, cause different PDF viewers to render completely different graphical content in a PDF, and that has existed since the very first Adobe PDF 1.0 specification way back in 1993.
The issue centers on the definition of Inline Images, a construct for including images in the content of a PDF. Like many other data objects in a PDF, inline images are represented as key-value dictionaries: key-value bindings define things like the image’s dimensions, masks and filters that must be applied to it, and colors used to render its content. Unlike all other data objects in a PDF, keys in an inline image may be referred to using either their full name or an abbreviation: e.g., the width of an image might be given using the full name Width or the abbreviation W.
Inline images have always been targeted at smaller images, with Adobe PDF 1.0 explicitly stating “The in-line format should be used only for small images (4K or less) because viewer applications have less flexibility when managing in-line image data.” The motivation for inline image key abbreviations isn’t given in the original Adobe PDF specifications, but we might surmise that they enabled smaller PDF files and were more efficient to compress on computers of the era with far less memory and far less capable CPUs.
Sounds simple enough as a feature, although it raises the question of whether the value bound to a full name or its abbreviation should be used if both occur in the same inline image. Like, say, this PDF does, right here (feel free to download: it doesn’t actually harm your system, it’s just hard to predict what exactly it will display).
The file contains eight images, defined by messing with various fields as follows:
The upshot of all of these tweaks is that if a PDF viewer always goes by abbreviations when they’re available, then all of the images will get displayed, and they’ll all look the same. In reality, here is what some highly prominent viewers do (names withheld so this doesn’t come off like a call-out, although honestly, it’s not the viewers’ fault: expected behavior simply wasn’t written down anywhere):
Which is all to show that different processors aren’t internally consistent with themselves like we would hope, and they certainly aren’t consistent with each other.
Surprisingly enough, the standard defines no course of action in cases like these. Maybe the more immediate concern is that different PDF viewers clearly end up rendering them in very different ways.
Once the issue is raised, it’s not hard to patch the standard and implementations to follow the patch. If anything, there are almost too many reasonable patches to choose from:
In the end, after some discussion, the PDF Association’s PDF Technical Working Group decided that “in the situation where both an abbreviated key name and the corresponding full key name from Table 91 are present, the abbreviated key name shall take precedence.” In a sense, the decision reaches even beyond the definition of the current PDF version (2.0) in that it acts as industry guidance for how actively maintained processors of previous versions can be altered to be consistent.
The final decision and the process used to arrive at it illustrate an important point about practical format designs: it’s tempting and not poorly justified to want to revise definitions to be aesthetically pleasing according to various tastes or to bring out parallels with well-studied formalisms, but often the adjustment to a standard that will likely result in the least toil and trouble is the one that strikes a balance between
In work supported by the DARPA SafeDocs program [5], we’ve been working to address a host of problems, many of which can be viewed as solving general forms of the problems that arose in the process of finding and fixing the issue with inline images described above.
It was hard to realize that the PDF standard was ambiguous because the PDF standard is 1,000 pages of English. Ideally, format definitions would be written in a data description language that’s intelligible to humans but is also machine-readable, in the sense that it would be equipped with
Such data-definition languages are about as far as you can get from a new concept in computer science: they’re just another name for classes of languages in the Chomsky hierarchy, which is one of the earliest results for organizing computational problems and machines that solve them [6], and which includes regular expressions and context-free grammars, formalisms that have been widely studied and which serve as the foundation for modern, ubiquitous, practical tools, such as grep and ANTLR [7]. But practical data formats like PDF aren’t regular or even context-free: they have components where the semantic interpretation of an early part of the document determines the structure of a latter part (e.g., an integer that defines the number of packets to follow), or which can be defined only by inspecting the document over many passes. So at present, many practical format definitions remain in the domain of language that is natural and informal.
In an effort to allow format definitions a more formal footing, we’ve been developing a novel data-description language, named DaeDaLus, that is expressive enough to define real formats precisely and unambiguously. DaeDaLus remains a very (very) experimental prototype, but it’s completely open-source and available on GitHub. If a format designer writes up their definition in DaeDaLus, it doesn’t leave room for ambiguities like the one concerning inline images. In fact, it was the process of trying to formalize the existing PDF standard in DaeDaLus that unearthed the issue in the first place.
Without giving a full tutorial on DaeDaLus, here’s a representative snippet that defines the BitsPerComponent entry of an Inline Image, and that’s hopefully self-contained enough for you to muddle through:
The basic idea behind the definition is to say that an Inline Image is a sequence of key-value pairs and that the resulting semantic value is a dictionary that binds each parsed key to its parsed value. In a sense, there’s no deep magic here: defining Inline Images explicitly like this won’t somehow guarantee that you’ll define that format as you intend. In fact, some of the available constructs (like empty and Insert, for maintaining maps/dictionaries) and parser libraries (like Fold, which extends the textbook Kleene iteration to maintain an accumulated value) make it pretty easy to get at least some working definition written down pretty quickly. But:
All of that said, while formal definitions help format experts to be sure that they’ve handled every case of a format’s definition, intuitive understanding of format definitions (and for that matter, computer programs) comes from examples. In an effort to help format experts understand the formats that they write, we’re developing a tool called Talos that, given a format definition, automatically generates documents that are in it (if you’re familiar with program analysis tools, think of it as a symbolic execution engine for a data-description language). Talos support is still experimental, but the current version is available in the DaeDaLus repository.
The Inline Image issue brings up at least two other key aspects of practical format design.
In an effort to support these and other workflows, we are developing the Format Analysis Workbench (FAW), a tool whose primary goal is to enable users to understand how formats are processed by their existing processors. For a format of interest, the FAW can be instantiated with a document corpus, a set of document processors, and a set of output patterns. It can then be used to process the corpus using the processors and subdivide the corpus, based on which documents result in which outputs. In addition, it can identify files that cause notable rendering differences between multiple parsers.
Day-to-day, the world manages to use data formats pretty effectively, arguably in part because format standards are published as sensible, reasonably precise guidelines, document producers try not to push the limits of the format too hard when they can help it, and document consumers attempt to be as flexible as possible, typically above and beyond what the standard calls for. However, as attackers begin to work more and more at the edges of behavior left undefined by standards and implemented ad-hoc by producers and consumers in isolation, well-intentioned but incomplete specs will no longer suffice. We’ve been working hard on tools that make it feasible to iron out actual document formats in all of their messy details and understand documents formats as they’re actually used: there are plenty of interesting problems (both theoretical and practical) left to bite off, so stay tuned.
[1] https://en.wikipedia.org/wiki/List_of_PDF_software
[2] https://www.iso.org/standard/75839.html
[3] https://pdfa.org
[4] https://github.com/pdf-association/pdf-issues
[5] https://www.darpa.mil/program/safe-documents