LangSec Workshop

at IEEE Security & Privacy, May 26th, 2022

Papers & Slides


Trojan Source and Bad Characters: Invisible Hacks and Reluctant Patching. Ross Anderson and Nicholas Boucher (U. of Cambridge)

Abstract: Your source code may be lying to you. Trojan Source Attacks inject hidden vulnerabilities into the encoding of source code files. In this talk, we describe how to launch these attacks, and how to prevent them. The story begins in the world of adversarial machine learning, where "Bad Characters" allow powerful attacks on natural-language processing systems. This led to the discovery that almost all programming languages were also susceptible to invisible vulnerabilities. Following an industry-wide coordinated disclosure, many of the major code editors and programming languages have implemented defenses, although almost no NLP providers have done so.

Ross Anderson Bio: Ross Anderson is Professor of Security Engineering at Cambridge and Edinburgh Universities. His research has ranged over cryptography, hardware security, usability, payment systems, healthcare IT, the economics of security and privacy, the measurement of cybercrime and abuse, and related policy issues.

Nicholas Boucher Bio: Nicholas Boucher is a PhD Candidate at the University of Cambridge where he performs security research. After completing studies at Harvard, Nicholas worked in cybersecurity at Microsoft before joining the Cambridge Computer Laboratory Security Group supervised by Professor Ross Anderson. His current research focuses on the security of compilers and machine learning systems.


What is a Secure Programming Language? Cristina Cifuentes (Oracle Software Assurance)

Abstract: Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today's mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be "secure". Our real criticism is the lack of a common understanding of what "secure" might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50% of all reported software vulnerabilities, that, as a minimum, any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three.

Looking at today's real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.



Research Panel

Devising a Blueprint to Eliminate Early-Stage Differentials; Letitia Li, Andrew Holmes, Sophia D'Antoine, Leonard Rosenthal, Sanford Bingham. Moderated by Prashant Anantharaman (Dartmouth).

Abstract: Many security vulnerabilities stem from or manifest as differentials between a specification and a parser, two parsers (i.e., parser differentials), or even two iterations of a specification. Defenses such as data description languages, LangSec-compliant parsers, and machine-readable specifications may be effective in addressing these differentials. However, they also face challenges, perhaps most notably barriers to adoption. Realizing the potential of these solutions requires identifying, understanding, and addressing these challenges. The proposed panel, if accepted, will bring together stakeholders with different expertise to discuss ongoing challenges in addressing differentials, primarily focusing on early-stage processes (e.g., specification design) and challenges in the adoption of proposed solutions. We hope this discussion will move us further toward the development of a practical blueprint for eliminating early-stage differentials.

Research Papers

"Anomaly Detection with Neural Parsers That Never Reject"; Alexander Grushin and Walt Woods (Galois Inc.).

Reinforcement learning has recently shown promise as a technique for training an artificial neural network to parse sentences in some unknown format, through a body of work known as RL-GRIT. A key aspect of the RL-GRIT approach is that rather than explicitly inferring a grammar that describes the format, the neural network learns to perform various parsing actions (such as merging two tokens) over a corpus of sentences, with the goal of maximizing the estimated frequency of the resulting parse structures. This can allow the learning process to more easily explore different action choices, since a given choice may change the optimality of the parse (as expressed by the total reward), but will not result in the failure to parse a sentence. However, this also presents a limitation: because the trained neural network can successfully parse any sentence, it cannot be directly used to identify sentences that deviate from the format of the training sentences, i.e., that are anomalous. In this paper, we address this limitation by presenting procedures for extracting production rules from the neural network, and for using these rules to determine whether a given sentence is nominal or anomalous. When a sentence is anomalous, an attempt is made to identify the location of the anomaly. We empirically demonstrate that our approach is capable of grammatical infer- ence and anomaly detection for both non-regular formats and those containing regions of high randomness/entropy. While a format with high randomness typically requires large sets of production rules, we propose a two pass grammatical inference method to generate parsimonious rule sets for such formats. By further improving parser learning, and leveraging the presented rule extraction and anomaly detection algorithms, one might begin to understand common errors, either benign or malicious, in practical formats.

[Paper] [Slides]

"Statistical Detection of Format Dialects Using the Weighted Dowker Complex"; Michael Robinson (American University), Letitia W. Li (BAE Systems), Cory Anderson (BAE Systems), and Steve Huntsman.

This paper provides an experimentally validated, probabilistic model of file behavior when consumed by a set of pre-existing parsers. File behavior is measured by way of a stan- dardized set of Boolean “messages” produced as the files are read. By thresholding the posterior probability that a file exhibiting a particular set of messages is from a particular dialect, our model yields a practical classification algorithm for two dialects. We demonstrate that this thresholding algorithm for two dialects can be bootstrapped from a training set consisting primarily of one dialect. Both the theoretical and the empirical distributions of file behaviors for one dialect yield good classification performance, and outperform classification based on simply counting messages.

Our theoretical framework relies on statistical independence of messages within each dialect. Violations of this assumption are detectable and allow a format analyst to identify “boundaries” between dialects. By restricting their attention to the files that lie within these boundaries, format analysts can more efficiently craft new criteria for dialect detection.

[Paper] [Slides]

"Certified Parsing of Dependent Regular Grammars"; John Sarracino (Cornell), Gang Tan (Penn State), and Greg Morrisett (Cornell)

Parsers are ubiquitous, but formal reasoning about the behavior of a parser is challenging. One key challenge is parsing dependent formats, which are difficult for traditional techniques to handle because parse values can influence future parsing behavior. We present dependent grammars, which extend regular languages with data-dependency by generalizing concatenation to monadic bind. Even this small tweak adds significant expressive power; for example, conditional parsing and dependent repetition are both implementable using monadic bind.

However, it is not obvious how to actually parse dependent grammars. Prior work generalized derivatives to monadic bind but at significant complexity. In fact, we show that the complexity arises from enabling embedded functions to use the input string as a recursive argument (i.e. enabling left-recursion). We implement a Brzozowski derivative based matching algorithm, and we show how many popular parser combinator functions can be implemented in our library while retaining the same simplicity as traditional parser combinators.

We implement and formalize these grammars in Coq, as well as a derivative-based matching algorithm. We prove soundness and completeness of the derivative operator in the standard way. We also implement a variety of popular parser combinator functions and give formal specifications to them. Finally, we implement as a case study a verified netstring parser, and prove functional correctness of the parser.

[Paper] [Slides]

"A Secure Parser Generation Framework for IoT Protocols on Microcontrollers"; Sameed Ali and Sean Smith (Dartmouth).

In the recent years, vulnerabilities found in the packet parsers of Bluetooth Low Energy (BLE) protocol have called for a need to have secure lightweight protocol packet parsers for mi- crocontrollers. Since these packet protocol grammars consist of packets of limited size it is possible to parse them efficiently via Finite State Machines (FSM). However, parsing via FSMs would require developers to either express the grammars via regular expressions or constructed hand-coded parsers. Unfor- tunately, hand-coding parsers is error-prone; furthermore, due in part to certain constructs found in such grammars which are not commonly found in text-based regular grammars. In addition, expressing binary grammar constructs in regular expression is not only challenging and error-prone but the resulting expressions are often complex and unreadable. Thus the lack of an alternative language for describing these con- structs is a hindrance to the use of finite state machines to generate parsers which are safe, secure and computationally bounded. This paper presents a novel secure parser generation framework which consists of an easy-to-use parser descrip- tion language called "Microparse" and a toolkit that utilizes finite state machines to generate lightweight parsers for micro- controllers. To demonstrate the viability of this approach, we have applied our framework to generate parsers for the BLE protocol running on an Ubertooth One Microcontroller. We demonstrate that the generated FSMs are lightweight enough to be run on devices with very limited resources, and are eas- ier to use for developers; we offer this method as a potential solution for the various bugs found in the implementation of the BLE firmware in the recent years.


"A Format-Aware Reducer for Scriptable Rewriting of PDF Files"; Prashant Anantharaman (Dartmouth), Steven Cheung (SRI International), Nicholas Boorman (SRI International), and Michael E. Locasto (Narf Industries).

Sanitizing untrusted input is a significant unsolved problem in defensive cybersecurity and input handling. Even if we assume that a safe, provably correct parser exists to validate the input syntax, processing logic may still require the application of certain transformations of the parser output. For example, parsers traditionally store the parsed objects in a generic tree structure; hence the processing logic needed to modify this structure can be significant. Also, popular parsing tools do not include the functionality to serialize (or unparse) the internal structure back to bytes.

This paper argues for the need for a format-aware tool to modify structured files. In particular, we propose adding a reducer to the Parsley PDF checker. The Parsley Reducer a tool to apply transformations on input dynamicallywould allow developers to design and implement rules to transform PDF files. Next, we describe a set of Parsley normalization tools to be used with the Reducer API and showcase their capabilities using several case studies. Finally, we evaluate our normalization approach to demonstrate that (1) the developer effort to design our reducer rules is minimal, (2) tools extract more text from transformed files than the original files, and (3) other popular PDF transformation tools do not apply the corrections we demonstrate.


"Capturing the iccMAX calculator Element: A Case Study on Format Design"; Vijay Kothari, Prashant Anantharaman, Sean Smith (Dartmouth), Briland Hitaj, Prashanth Mundkur, Natarajan Shankar (SRI International), Letitia Li (BAE Systems), Iavor Diatchki and William Harris (Galois Inc.).

ICC profiles are widely used to provide faithful digital color reproduction across a variety of devices, such as monitors, printers, and cameras. In this paper, we document our efforts on reviewing and identifying security issues with the calculatorElement description from the recent iccMAX specification (ICC.2:2019), which expands upon the ICC v4 specification (ICC.1:2010). The iccMAX calculatorElement, which captures a calculator function through a stack-based computational approach, was designed with security in mind. We analyzed the iccMAX calculatorElement using a variety of approaches that utilized: the proof assistant PVS, the theorem-proving language ACL2, the data description language DaeDaLus, and tools tied to the data description language Parsley. Bringing the tools of formal data description, theorem proving, and static analysis to a non-trivial real-world specifi- cation has shed light on both the tools and the specification. This exercise has led us to discover numerous bugs within the specification, to identify specification improvements, to identify flaws with a demo implementation, and to recognize ways that we can improve our own tools. Additionally, this particular case study has broader implications for those who work with specification, data description languages, and parsers. In this paper, we document our work on this exercise and relay our key findings.

[Paper] [Slides]

Research Reports

"Strengthening Weak Links in the PDF Trust Chain"; Mark Tullsen (Galois Inc.), William Harris (Galois Inc.), and Peter Wyatt (PDF Association).

In many practical and security-critical formats, the interpretation of a document segment as a Document Object Model (DOM) graph depends on a concept of reference and complex contextual data that binds references to data objects. Such referential context itself is defined discontinuously, and is often compressed, to satisfy practical constraints on usability and performance. The integrity of these references and their context must be ensured so that an unambiguous DOM graph is established from a basis of trust.

This paper describes a case study of a critical instance of such a design, namely the construction of PDF cross-reference data, in the presence of potentially multiple incremental updates and multiple complex dialects expressing these references. Over the course of our case study, we found that the full definition of cross-reference data in PDF contains several subtleties that are interpreted differently by natural implementations, but which can nevertheless be formalized using monadic parsers with constructs for explicitly capturing and updating input streams.

Producing our definition raised several issues in the PDF standard acknowledged and addressed by the PDF Association and the ISO. In the future, the definition can serve as a foundation for implementing novel format security analyses of DOM-defining formats.


"Progress on Building a File Observatory for Secure Parser Development"; Tim Allison, Wayne Burke, Dustin Graf, Chris Mattmann, Anastasija Mensikova, Mike Milano, Philip Southam, Ryan Stonebraker (Jet Propulsion Laboratory).

Parsing untrusted data is notoriously challenging. Failure to handle maliciously crafted data correctly can (and does) lead to a wide range of vulnerabilities. The Language- theoretic security (LangSec) philosophy seeks to obviate the need for developers to apply ad hoc solutions by, instead, offering formally correct and verifiable input handling throughout the software development lifecycle. One of the key components in developing secure parsers is a broad coverage corpus that enables developers to understand the problem space for a given format and to use, potentially, as seeds for fuzzing and other automated testing. In this paper, we offer an update on work reported at the LangSec 2021 conference on the development of a file observatory to gather and enable analysis on a diverse collection of files at scale. The initial focus of the observatory is on Portable Document Format (PDF) files and file formats typically embedded in PDFs. In this paper, we report on refactoring the ingest process, applying new analytic methods, and improving the User Interface.


"On the Feasibility of Retrofitting Operating Systems with Generated Protocol Parsers"; Wayne Wang and Peter C. Johnson (Middlebury College).

Parser generators show great promise in producing more secure input validation and processing routines. Operating system kernels are a particularly appealing place to deploy generated parsers due to their position at the periphery of a machine’s attack surface, the power they wield, and their complexity. At the same time, kernels can also be byzantine and idiosyncratic. Before we attempt to create generated parsers for wide deployment into kernels, therefore, it is important to understand how much of the existing codebase those parsers will replace, what (if any) functionality beyond parsing the generated code will need to implement, and what kernel facilities the code must integrate with. To answer these questions, we analyzed three protocol implementations in each of three open-source kernels to understand their behavior and organization. We identified commonalities and differences, measured the quantity of code used for different purposes, identified when and how multiplexing decisions were made, and analyzed the order in which operations were performed. We intend this analysis to inform the creation of generated parsers, for a variety of protocols, that can be smoothly integrated into modern kernels.

[Paper] [Slides]

Work in Progress

"Linguistic Tools for Managing Grammatical Domains"; Anders Miltner (UT Austin), Devon Loehr (Princeton), Arnold Mong (Princeton), Kathleen Fisher (Tufts), David Walker (Princeton).

Many common data types, such as dates, phone numbers, and addresses, have multiple textual representations. The possibility of varied representations creates ambiguities when parsing this data and can easily lead to bugs in what would otherwise be straightforward data processing applications. In this work-in-progress paper, we explore this problem and consider the design of linguistic tools to help manage it. More specifically, we introduce the idea of a grammatical domaina set of related grammarswhich we use to characterize the many possible representations of a date or phone number or other object of this sort. We also propose the design of a YACC-like language, which is capable of defining individual grammatical domains, such as dates, or larger document formats that contain such ambiguous elements. We illustrate our ideas via example, and sketch a continuing research agenda aimed at producing tools to help programmers process these ambiguous formats.

[Paper] [Slides]

"Input synthesis for data-dependent grammars"; Simon Winwood and Mike Dodds (Galois Inc.).

We present Talos, a work-in-progress tool for synthesizing examples of data formats encoded in the DaeDaLus data-description language. Talos is based on a multi-layer design: we first analyze a format to identify independent slices, and we then use search and SMT solving to synthesize solutions for these slices. Talos works for small examples of DaeDaLus formats and we are working on scaling it to two target examples: the PDF and NITF format descriptions we have previously encoded in DaeDaLus.