Abstract: Cybersecurity is a hard problem with significant negative personal, financial, and national security ramifications for getting it wrong. Part of the reason it is difficult is that defenders have to lock every door whereas defenders only have to find one way in. Protecting against malicious program inputs is the equivalent of locking the front door, but according to DARPA's Safedocs Broad Agency Announcement, 80% of the issues in MITRE's Common Vulnerabilities and Exposures database involve input validation failures. We're not even properly locking the front door! What would high-assurance input validation look like? At the very least, it requires multiple steps:
(1) specifying the input language in a formal description language, (2) validating the specification through systematic testing, (3) generating a high-assurance parser, and then (4) ensuring that client code appropriately handles all terms in the input language.
Of course, this vision introduces multiple practical problems including the expressiveness of the formal description language, the existence and usability of the high-assurance parsing infrastructure, the performance of the generated parsers, and the applicability of reasoning tools for assessing client code. In this talk, I will survey the state of the art in high-assurance input validation, discuss impediments to adoption, and speculate on future research directions.
Bio: Kathleen Fisher is a Professor in and Chair of the Computer Science Department at Tufts University. Previously, she was a Program Manager at DARPA and a Principal Member of the Technical Staff at AT&T Labs Research. She received her PhD in Computer Science from Stanford University. Kathleen's research focuses on advancing the theory and practice of programming languages and on applying ideas from the programming language community to the problem of ad hoc data management. Kathleen is a Fellow of the ACM and a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN). She has served as program chair for PLDI, ICFP and OOPSLA, as an editor for the Journal of Functional Programming, and as an Associated Editor of TOPLAS. She served as Co-Chair of CRA-W from 2008-2011 and is a frequent speaker at CRA-W events. She is a former Chair of DARPA’s ISAT and a member of CCC Council. She was a Hertz Foundation Fellow.
Abstract: This paper describes relationships between various state-of-the-art neural network architectures and formal languages as, for example, structured by the Chomsky Language Hierarchy. Of particular interest are the abilities of a neural architecture to represent, recognize and generate words from a specific language by learning from positive and negative samples of words in the language. Of specific interest are some relationships between languages, networks and topology that we outline analytically and explore through several illustrative experiments. By specifically comparing analytic results relating formal languages to topology via algebraic word problems with empirical results based on neural networks and persistent homology calculations, we see evidence that certain observed topological properties match analytically predicted properties. Such results are encouraging for understanding the role that modern machine learning can play in formal language processing problems.
Cybenko Bio: George Cybenko is the Dorothy and Walter Gramm Professor of Engineering at Dartmouth. Professor Cybenko has made key research contributions in signal processing, neural computing, parallel processing and computational behavioral analysis. He was the Founding Editor-in-Chief of IEEE/AIP Computing in Science and Engineering, IEEE Security & Privacy and IEEE Transactions on Computational Social Systems. He has served on the Defense Science Board and the Air Force Scientific Advisory Board in the past and is an advisor to the Army Cyber Institute at West Point. Professor Cybenko is a Fellow of the IEEE and SIAM. He received his BS (University of Toronto) and PhD (Princeton) degrees in Mathematics. Cybenko was a co-founder of Flowtraq Inc, which was acquired by Riverbed Technology in 2017.
Ackerman Bio: Joshua Ackerman is a PhD student in The Institute for Security, Technology and Society at Dartmouth College. His research is broadly focused on the intersection between machine learning and cybersecurity, with a particular focus on how to make machine learning systems more trustworthy and reliable. He received his BS degree in Mathematics with a minor in Computer Science from Carnegie Mellon University.
Lexers and parsers are often used as front ends to connect input from the outside world with the internals of a larger software system. These front ends are natural targets for attackers who wish to compromise the larger system. A formally verified tool that performs mechanized lexical analysis would render attacks on these front ends less effective.
In this paper we present Verbatim, an executable lexer that is implemented and verified with the Coq Proof Assistant. We prove that Verbatim is correct with respect to a standard lexer specification. We also analyze its theoretical complexity and give results of an empirical performance evaluation. All correctness proofs presented in the paper have been mechanized in Coq.
When binary data are sent from one party to another one, the encoding of the data can be described as a "data serialisation" language (DSL). Many DSLs employ the "length-prefix" pattern for strings, containers and other data items of variable length. This consists of an encoding of the item's length, followed by an encoding of the item itself -- without closing brackets or "end" symbols. The receiver must determine the final byte from the length read before. Length-prefix languages are not context-free. Thus, the plethora of tools and methods to specify, analyse, and parse context-free languages appears to be useless for length-prefix languages. This seems to explain why improper specifications of length-prefix languages and buggy hand-written parsers are so often a root cause for security issues and exploits, as, e.g., in the case of the famous Heartbleed bug. One might even be tempted to consider the use of length-prefix languages a security hazard.
But this consideration would be wrong. We present a transformation of words from "calc-context-free" languages (a superset of context-free and length-prefix languages) into words from proper context-free languages. The transformation actually allows to use tools from context-free languages to deal with length-prefix languages.
Our transformation runs on a Turing machine with logarithmic space. This implies the theoretical result of calc-context-free languages being in the complexity class logCFL. Similarly, deterministic calc-context-free languages are in logDCFL. To run in linear time, one needs to enhance the Turing machine by a stack to store additional data.
Safety- and security-critical developers have long recognized the importance of applying a high degree of scrutiny to a system's (or subsystem's) I/O boundaries. However, lack of care in the development of such filter components can lead to an increase, rather than a decrease, in the attack surface. On the DARPA Cyber-Assured Systems Engineering (CASE) program, we have focused our research effort on identifying cyber vulnerabilities early in system development, in particular at the Architecture development phase, and then automatically synthesizing components that mitigate against the identified vulnerabilities from high-level specifications. This approach is highly compatible with the goals of the LangSec community. Advances in formal methods have allowed us to produce hardware/software implementations that are both performant and guaranteed correct. With these tools, we can synthesize high-assurance ``building blocks'' that can be composed automatically with high confidence to create trustworthy systems, using a method we call Security-Enhancing Architectural Transformations. Our synthesis-focused approach provides a higher-leverage insertion point for formal methods than is possible with post facto analytic methods, as the formal methods tools directly contribute to the implementation of the system, without requiring developers to become formal methods experts. Our techniques encompass Systems, Hardware, and Software Development, as well as Hardware/Software Co-Design/Co-Assurance. We illustrate our method and tools with an example that implements security-improving transformations on system architectures expressed using the Architecture Analysis and Design Language (AADL). We show how input validation filter components can be synthesized from high-level regular or context-free language specifications, and verified to meet arithmetic constraints extracted from the AADL model. Finally, we guarantee that the intent of the filter logic is accurately reflected in the application binary code through the use of the verified CakeML compiler, in the case of software, or the Restricted Algorithmic C toolchain with ACL2-based formal verification, in the case of hardware/software co-design.
Security flaws in Portable Document Format (PDF) readers may allow PDFs to conceal malware, exfiltrate information, and execute malicious code. For a PDF reader to identify these flawed PDFs requires parsing and then analyzing the semantic properties or structure of the parse result. This paper presents how developer-accessible formal methods supports a language-theoretic security approach to PDF parsing and validation. We develop a PDF parser and validator in ACL2, a theorem-proving language which allows us to generate proofs of desired properties on functions, such as correctness. The structure of the parser and semantic rules are defined by the PDF grammar and fall into certain patterns, and therefore can be automatically composed from a set of verified base functions. Instead of requiring a developer to be familiar with formal methods and to manually write ACL2 code, we use Tower, our modular metalanguage, to generate verified ACL2 functions and proofs and the equivalent C code to analyze semantic properties, which can then be integrated into our verified parser or an existing parser to support PDF validation.
Differential fuzzing replaces traditional fuzzer oracles like crashes, hangs, unsound memory accesses with a difference oracle, where an implementation of a specification is said to be potentially erroneous if its behavior differs from another implementation's on the same input. Differential fuzzing has been applied successfully to cryptography and complex application format parsers like PDF and ELF.
This paper describes the application of differential fuzzing to x86-64 instruction decoders for bug discovery. It introduces MISHEGOS, a novel differential fuzzer that discovers decoding discrepancies between instruction decoders. We describe MISHEGOS's architecture and approach to error discovery, as well as the security implications of decoding errors and discrepancies. We also describe a novel fuzzing strategy for instruction decoders on variable-length architectures based on an over-approximated model of machine instructions.
MISHEGOS produces hundreds of millions of decoder tests per hour on modest hardware. We have used MISHEGOS to discover hundreds of errors in popular x86-64 instruction decoders without relying on a hardware decoder for ground truth. MISHEGOS includes an extensible framework for analyzing the results of a fuzzing campaign, allowing users to discover errors in a single decoder or a variety of discrepancies between multiple decoders. We provide access to MISHEGOS's source code under a permissive license.
Parsing is ubiquitous in software projects, ranging from small command-line utilities, highly secure network clients, to large compilers. Programmers are provided with a plethora of parsing libraries to choose from. However, implementation bugs in parsing libraries allow the generation of incorrect parsers, which in turn may allow malicious inputs to crash systems or launch security exploits. In this paper we describe a lightweight validation framework called Bohemia that a parsing library developer can use as a tool in a toolkit for integration testing. The framework makes use of the concept of Equivalence Modulo Inputs (EMI) in order to generate mutated input grammars to stress test the parsing library. We also describe the result of evaluating Bohemia with a set of parsing libraries that utilize distinct parsing algorithms. During the evaluation, we found a number of bugs in those libraries. Some of those have been reported to and fixed by developers.
When a format expert is working to understand usage of a data format, examples of the data format are often more representative than the format's specification. For example, two different applications might use very different JSON representations, or two PDF-writing applications might make use of very different areas of the PDF specification to realize the same rendered content. The complexity arising from these distinct origins can lead to large, difficult-to-understand attack surfaces, presenting a security concern when considering both exfiltration and data schizophrenia. Grammar inference can aid in describing the practical language generator behind examples of a data format. However, most grammar inference research focuses on natural language, not data formats, and fails to support crucial features such as type recursion. We propose a novel set of mechanisms for grammar inference, RL-GRIT, and apply them to understanding de facto data formats. After reviewing existing grammar inference solutions, it was determined that a new, more flexible scaffold could be found in Reinforcement Learning (RL). Within this work, we lay out the many algorithmic changes required to adapt RL from its traditional, sequential-time environment to the highly interdependent environment of parsing. The result is an algorithm which can demonstrably learn recursive control structures in simple data formats, and can extract meaningful structure from fragments of the PDF format. Whereas prior work in grammar inference focused on either regular languages or constituency parsing, we show that RL can be used to surpass the expressiveness of both classes, and offers a clear path to learning context-sensitive languages. The proposed algorithm can serve as a building block for understanding the ecosystems of de facto data formats.
Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this article focuses upon the case of the PDF format (ISO 32000-2), we make no attempt to use any specific details of the format. Furthermore, we show how principal components analysis can be useful for a format specification designer to assess the quality and structure of these samples of files and parsers. While these tests are absolutely rudimentary, it appears that their use to measure file format variability and to identify non-compliant files is both novel and surprisingly effective.
We model a security-typed language with gradual information flow labels in a proof assistant, present the semantics as a definitional interpreter, and prove type safety. We compare the language features and properties of various existing gradual security-typed languages, shedding light on future designs.
This paper presents the Arlington PDF Model as the first open access, comprehensive specification-derived machine-readable definition of all formally defined PDF objects and data integrity relationships. This represents the bulk of the latest 1,000-page ISO PDF 2.0 specification and is a definition for the entire PDF document object model, establishing a state of the art “ground truth” for all future PDF research efforts and implementers. Expressed as a set of text-based TSV files with 12 data fields, the Arlington PDF Model currently defines 514 different PDF objects with 3,551 keys and array elements and uses 40 custom predicates to encode over 5,000 rules. The Arlington PDF Model has been successfully validated against alternate models, as well as a sizeable corpus of extant data files and has been widely shared within the SafeDocs research community as well as the PDF Association’s PDF Technical Working Group. It has already highlighted various extant data malformations and triggered multiple changes to the PDF 2.0 specification to reflect the de-facto specification, remove ambiguities, and correct errors.
Parsing untrusted data is notoriously challenging. Failure to handle maliciously crafted data correctly can (and does) lead to a wide range of vulnerabilities. The Languagetheoretic security (LangSec) philosophy seeks to obviate the need for developers to apply ad hoc solutions by, instead, offering formally correct and verifiable input handling throughout the software development lifecycle. One of the key components in developing secure parsers is a broad coverage corpus that enables developers to understand the problem space for a given format and to use, potentially, as seeds for fuzzing and other automated testing. In this paper, we offer an update on the development of a file observatory to gather and enable analysis on a diverse collection of files at scale. Specifically, we report on the addition of a bug tracker corpus and new analytic methods on our existing corpus.
Since parsers are the line of defense between binaries and untrusted data, they are some of the most common sources of vulnerabilities in software. Language-Theoretic Security provides an approach to implement hardened parsers. We specify the binary format as a formal grammar and implement a recognizer for this formal grammar. However, most binary formats use constructs such as the length field, repeat field, and an offset instruction. Most grammar formats do not support these features.
Building on PEGs and calc-regular languages, we propose Calc-Parsing Expression Grammars (Calc-PEGs), a formalization of parsing expression grammars that supports the length field. We design an algorithm to parse Calc-PEGs in O(n^2) time and a parallel algorithm to parse Calc-PEGs in O(n) time. We also present Pegmatite, a tool to generate these parsers in C, with an option to generate VHDL code.
In this Work in Progress report, we describe ongoing research on our DECIMAL project, addressing the problem of modeling computational mechanisms at sufficient fidelity to reason about the execution semantics of programs across abstraction boundaries. The automata-based formalism that we have developed is specifically constructed to support reasoning about timed behavior over compositions of multiple component automata, modeling different parts of the system under study. We show how we use composition to model a wide variety of constructs, including synchronization across abstraction boundaries, communicating asynchronous processes, and specifying programs that can be generalized across different architectures and over localized variations in the program specification.
This paper presents a parser generation toolkit which gen-erates verified parsers from a user supplied parser descrip-tion. This parser description is described using a novel parserdescription language (PDL). The parsers generated by thisparser description language are not only verified (to ensureparser termination and to prevent memory corruption) butalso strictly limited in expressivity as the PDL is backed witha formal language model (a finite state machine).The limited expressivity will ensure the generated parsershave mathematical bounds on their complexity, allowing us toreason about properties such as decideability and equivalence.Although not the focus of this work, in the future we aimto extend this work with a module which allows us to testfor parser equivalence of one or more generated parsers. Anarbitrary hand coded parser written in C does not allow usthis flexibility to reason about such properties.These parsers arezero−co py, designed to handle largescale finite-state machines and are also space-efficient enoughthat they can be deployed on microcontrollers with limitedcomputing resources. Further, the PDL is designed to be con-cise and descriptive enough to describe frequently found gram-matical constructs found in packets of network protocols.We evaluated the parsers generated for the Bluetooth LowEnergy (BLE) LL (link layer) packets by deploying themon the firmware of the Ubertooth One device and attackingthem by hand-crafted malformed BLE LL packets. Our initialresults show the hardened parsers are effective against themalformed packets. Our goal is to test it with a wide varietyof known BLE LL exploits, and fuzz it as well to evaluatethoroughly the security of these parsers.Furthermore, in the future we also aim to investigate theeffect these parsers have on network throughput and perfor-mance of the BLE microcontrollers
We demonstrate a novel way of using data mediation and Pareto optimization to add Language-Theoretic security to software applications by evaluating the application's data structures and their interaction with other data through the application's operation. We create a general model and automated programs that provides a developer with a straightforward way to select a balance point between security coverage and performance.
Binary parsers are ubiquitous in the software we use everyday, be it to interpret file formats or network protocol messages. However, parsers are usually fragile and are a common place for bugs and security vulnerabilities. Over the years, several projects have been developed to try and solve the problem, using different forms such as parser combinators or domain-specific languages. To better understand this rich ecosystem, we initiated a platform to test and compare such tools against different specifications. With our so-called "LangSec Platform", we uncovered bugs in various implementations and gained insight into these tools.