Many cloud applications use both serverless functions, for bursts of stateless parallel computation, and container orchestration, for long-running microservices and tasks that need to interact. Ideally a single platform would offer the union of these systems' capabilities, but neither is sufficient to act as that single platform: serverless functions are lightweight but cannot act as servers with long-term state, while container orchestration offers general-purpose computation but instance start-up takes too long to support burst parallelism. SigmaOS is a new multi-tenant cloud operating system that combines the best of container orchestration and serverless in one platform with one API. SigmaOS computations, called procs, can be long-running, stateful, and interact with each other, making them a good match for both serverless and microservice tasks. A key aspect of the SigmaOS design is its *cloud-centric* API, which provides flexible management of computation, a novel abstraction for communication endpoints, SigmaEPs ---which allow procs of a tenant to communicate efficiently but prohibits procs from sending packets to other tenants---and a flexible naming system to name, for example, SigmaEPs. Quick start-up is important for serverless uses. A key enabling observation is that both serverless and microservice applications rely on cloud services for much of the work traditionally done by the local OS (e.g., access to durable storage and additional compute resources). SigmaOS exploits this observation by providing only a small and generic local operating system image to each , which can be created much more quickly than a container orchestration instance since SigmaOS need not install application-specific filesystem content or (due to SigmaOS's SigmaEPs) configure an isolated overlay network. Microbenchmarks show that SigmaOS can cold start a in msec and can create procs per second, distributing them over a 24-machine cluster. An evaluation of SigmaOS with two microservice applications from DeathStarBench, a MapReduce application, and an image processing benchmark, shows that the SigmaOS API supports both microservices and lambda-style computations, and provides better performance than corresponding versions on AWS Lambda and Kubernetes.
@inproceedings{sigmaos:sosp24, title = {Unifying serverless and microservice tasks with SigmaOS}, author = {Ariel Szekely and Adam Belay and Robert Morris and M. Frans Kaashoek}, booktitle = {Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP 2024)}, year = 2024, month = nov, address = {Austin, Texas}, }
Parfait is a framework for proving that an implementation of a hardware security module (HSM) leaks nothing more than what is mandated by an application specification. Parfait proofs cover the software and the hardware of an HSM, which catches bugs above the cycle-level digital circuit abstraction, including timing side channels. Parfait's contribution is a scalable approach to proving security and non-leakage by using intermediate levels of abstraction and relating them with transitive information-preserving refinement. This enables Parfait to use different techniques to verify the implementation at different levels of abstraction, reuse existing verified components such as CompCert, and automate parts of the proof, while still providing end-to-end guarantees. We use Parfait to verify four HSMs, including an ECDSA certificate-signing HSM and a password-hashing HSM, on top of the OpenTitan Ibex and PicoRV32 processors. Parfait provides strong guarantees for these HSMs: for instance, it proves that the ECDSA-on-Ibex HSM implementation—2,300 lines of code and 13,500 lines of Verilog—leaks nothing more than what is allowed by a 40-line specification of its behavior.
@inproceedings{parfait:sosp24, title = {Modular Verification of Secure and Leakage-Free Systems: From Application Specification to Circuit-Level Implementation}, author = {Anish Athalye and Henry Corrigan-Gibbs and M. Frans Kaashoek and Joseph Tassarotti and Nickolai Zeldovich}, booktitle = {Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP 2024)}, year = 2024, month = nov, address = {Austin, Texas}, }
Hardware and software systems are susceptible to bugs and timing side-channel vulnerabilities. Timing leakage is particularly hard to eliminate because leakage is an emergent property that can arise from subtle behaviors or interactions between hardware and software components in the entire system, with root causes such as non-constant-time code, compiler-generated timing variation, and microarchitectural side channels. This thesis contributes a new approach using formal verification to rule out such bugs and build systems that are correct, secure, and leakage-free.
This thesis introduces a new theory called information-preserving refinement (IPR) for capturing non-leakage in addition to correctness and security, implements a verification approach for IPR in the Parfait framework, and applies it to verifying hardware security modules (HSMs). Using Parfait, a developer can verify that an HSM implementation leaks no more information than is allowed by a succinct application-level specification of the device's intended behavior, with proofs covering the implementation's hardware and software down to its cycle-precise wire-I/O-level behavior.
This thesis uses Parfait to implement and verify several HSMs, including an ECDSA certificate-signing HSM and a password-hashing HSM, on top of Ibex and PicoRV32-based hardware platforms. Parfait provides strong guarantees for these HSMs: for example, it proves that the ECDSA-on-Ibex implementation—2,300 lines of code and 13,500 lines of Verilog—leaks nothing more than what is allowed by a 40-line specification of its behavior.
@phdthesis{aathalye:thesis, title = {Formally Verifying Secure and Leakage-Free Systems: From Application Specification to Circuit-Level Implementation}, author = {Anish Athalye}, school = {Massachusetts Institute of Technology}, year = 2024, month = aug, }
As engineers continue to develop more sophisticated algorithms to optimize cryptographic algorithms, their often simple mathematical specifications become convoluted in the algorithms, from which a class of correctness bugs arise. Because cryptographic algorithms often secure sensitive information, their correctness, and in turn their security is a top priority. The Number Theoretic Transform (NTT) is an algorithm that enables efficient polynomial multiplication and has recently gained importance in post-quantum cryptography. This thesis presents a proof of correctness of the NTT in F★, a proof-oriented programming language that extracts to OCaml, and shows that we can use the NTT to perform polynomial multiplications. We provide an implementation of the Cooley-Tukey fast NTT algorithm and a proof that it matches the original NTT specification. This thesis also presents a representation of polynomials in the F★ subset Low★, which extracts to performant C code.
@mastersthesis{rrono-meng, title = {Verifying Correctness of the Number Theoretic Transform and Fast Number Theoretic Transform in {F*}}, author = {Ryuta R. Ono}, school = {Massachusetts Institute of Technology}, year = 2024, month = aug, }
Hardware security modules (HSMs) are powerful tools in building secure computer systems, allowing developers to factor out security-critical code to separate devices. Because HSMs usually work with sensitive data, it is crucial that we are able to verify that they are secure. Many HSMs today also include true random number generators (TRNGs) as part of their architecture to seed cryptographic functions for generating keys, creating nonces, padding, and more. This thesis presents a definition of Information-Preserving Refinement with Randomness (IPRR) that captures the idea that a HSM with a TRNG is correct and is secure from timing side channel attacks. We additionally construct a strategy to prove IPRR, and develop Karatroc, a tool for verifying that a HSM satisfies IPRR. Through the creation and evaluation of Karatroc, we demonstrate the ability to verify HSMs with TRNGs without incurring significant added cost in performance and proof length as compared to existing proof methods.
@mastersthesis{zhao-meng, title = {Verifying Hardware Security Modules With True Random Number Generators}, author = {Katherine Zhao}, school = {Massachusetts Institute of Technology}, year = 2024, month = may, }
Cryptographic primitives—hash functions, symmetric key encryption algorithms, asymmetric key exchange algorithms, and more—are used everywhere to achieve security in modern computing. Since these algorithms have complicated, math-heavy implementations, they are typically used through cryptographic library functions. However, many timing side-channel attacks, which leak information when execution time depends on secrets, have been found in popular cryptographic libraries, such as OpenSSL. Formal verification aims to rule out timing side channels in cryptographic software.
This thesis presents Quake, a framework for verifying cryptographic library functions are constant-time for a specific hardware implementation, regardless of where the code is located in memory. Quake represents the location of code in memory using symbolic addresses and introduces a ROM model that gets concrete memory data from symbolic addresses. This thesis evaluates Quake and demonstrates that it can detect address-dependent timing behavior and does so in a reasonable amount of time.
@mastersthesis{jyxu-meng, title = {Towards Cycle-Level Verification of Constant-Time Cryptography}, author = {Jessica Y. Xu}, school = {Massachusetts Institute of Technology}, year = 2024, month = may, }
@phdthesis{tslilyai:thesis, title = {Flexible Privacy via Disguise and Revealing}, author = {Lillian Tsai}, school = {Massachusetts Institute of Technology}, year = 2024, month = may, }
@inproceedings{tiptoe:sosp23, author = {Alexandra Henzinger and Emma Dauterman and Henry Corrigan-Gibbs and Nickolai Zeldovich}, title = {Private Web Search with {Tiptoe}}, booktitle = {Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023)}, year = 2023, month = oct, address = {Koblenz, Germany}, }
@inproceedings{edna:sosp23, author = {Tsai, Lillian and Gross, Hannah and Kohler, Eddie and Kaashoek, Frans and Schwarzkopf, Malte}, title = {Edna: Disguising and Revealing User Data in Web Applications}, booktitle = {Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023)}, year = 2023, month = oct, address = {Koblenz, Germany}, }
@inproceedings{grove:sosp23, title = {{G}rove: a {S}eparation-{L}ogic {L}ibrary for {V}erifying {D}istributed {S}ystems}, author = {Upamanyu Sharma and Ralf Jung and Joseph Tassarotti and Frans Kaashoek and Nickolai Zeldovich}, pages = {113--129}, booktitle = {Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023)}, year = 2023, month = oct, address = {Koblenz, Germany}, }
K2 is a new architecture and verification approach for hardware security modules (HSMs). The K2 architecture's rigid separation between I/O, storage, and computation over secret state enables modular proofs and allows for software development and verification independent of hardware development and verification while still providing correctness and security guarantees about the composed system. For a key step of verification, K2 introduces a new tool called Chroniton that automatically proves timing properties of software running on a particular hardware implementation, ensuring the lack of timing side channels at a cycle-accurate level.
@inproceedings{k2:kisv23, title = {The {K2} Architecture for Trustworthy Hardware Security Modules}, author = {Anish Athalye and M. Frans Kaashoek and Nickolai Zeldovich and Joseph Tassarotti}, booktitle = {1st {W}orkshop on {K}ernel {I}solation, {S}afety and {V}erification ({KISV} '23)}, year = 2023, month = oct, }
SigmaOS is a new multi-tenant cloud operating system that simplifes distributed application development. Its design centers around the novel concepts of realms and procs. A realm presents a tenant with a shared global namespace that hides the machine boundaries. Tenants structure their applications as process-like procs interacting through the realm’s namespace. Procs are lightweight, stateful, and can communicate. SigmaOS manages the scheduling and execution of procs to achieve high resource utilization and performance isolation.
This thesis compares SigmaOS with Kubernetes, a mainstream cloud operating system, using a microservice-style social network website and a serverless image resizing program. It measures their performances on a small-scale cluster in CloudLab. The SigmaOS version of the social network is easier to build (30% fewer lines), and its image resizing starts faster (25% - 89%). SigmaOS performs comparably to Kubernetes regarding latency and resource consumption when running a single application but provides better performance isolation when running multiple applications in separate realms: latency increases by 4-11% with concurrent applications in SigmaOS versus over 150% in Kubernetes.
@mastersthesis{yizheng-he-ms-thesis, title = {Evaluating SigmaOS with Kubernetes for Orchestrating Microservice and Serverless Applications}, author = {Yizheng He}, school = {Massachusetts Institute of Technology}, year = 2023, month = sep, }
@inproceedings{simplepir:usenixsec23, author = {Alexandra Henzinger and Matthew M. Hong and Henry Corrigan-Gibbs and Sarah Meiklejohn and Vinod Vaikuntanathan}, title = {One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval}, booktitle = {32nd USENIX Security Symposium (USENIX Security 23)}, year = 2023, address = {Anaheim, CA}, publisher = {USENIX Association}, month = aug, }
Multi-version concurrency control (MVCC) is a widely used, sophisticated approach for handling concurrent transactions. vMVCC is the first MVCC-based transaction library that comes with a machine-checked proof of correctness, providing clients with a guarantee that it will correctly handle all transactions despite a complicated design and implementation that might otherwise be error-prone. vMVCC is implemented in Go, stores data in memory, and uses several optimizations, such as RDTSC-based timestamps, to achieve high performance (25--96% the throughput of Silo, a state-of-the-art in-memory database, for YCSB and TPC-C workloads). Formally specifying and verifying vMVCC required adopting advanced proof techniques, such as logical atomicity and prophecy variables, owing to the fact that MVCC transactions can linearize at timestamp generation prior to transaction execution.
@inproceedings{vmvcc:osdi23, title = {Verifying {vMVCC}, a high-performance transaction library using multi-version concurrency control}, author = {Yun-Sheng Chang and Ralf Jung and Upamanyu Sharma and Joseph Tassarotti and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 17th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '23)}, year = 2023, month = jul, }
Plat is a FIDO2 security key that uses privilege separation to protect the application's private keys even if bugs are present in bug-prone parts of its codebase. Plat's design encapsulates drivers and parsers in sandboxes that are isolated from the secrets that are used to perform authentication.
To achieve privilege separation in the embedded context, Plat uses a new WebAssembly-based toolchain for ARM microcontrollers to implement and enforce isolation between individual components of an existing system without rewriting drivers and application code. This toolchain includes special support for device drivers, safely enabling isolated modules to access peripheral memory-mapped IO.
Plat's privilege-separation reduces the lines of code in the trusted code base by 60% from our 20,000-line reference implementation while adding only 319 new trusted lines. Plat's isolation strategy has acceptable performance overhead that does not prevent interactive use, with the slowest step of an authentication jumping from 277ms natively to 600ms when sandboxed.
Plat ensures the protection of its secret key, and thus the security of the accounts it authenticates, in the presence of several classes of bugs.
@mastersthesis{bkettle-meng, title = {Privilege-separating embedded applications using {WebAssembly} in the {Plat} {FIDO2} security key}, author = {Benjamin B. Kettle}, school = {Massachusetts Institute of Technology}, year = 2023, month = jun, }
We propose abandoning leakage models for verifying timing properties of cryptographic software, instead directly verifying software with respect to a hardware implementation at the RTL level. Early experiments include verifying that an Ed25519 implementation running on a 6-stage pipelined processor executes in a constant number of cycles. Many challenges remain, including scaling up to modern out-of-order speculative cores and extending the approach to reason about library code outside the context of a whole application.
@inproceedings{chroniton:plarch23, title = {Leakage models are a leaky abstraction: the case for cycle-level verification of constant-time cryptography}, author = {Anish Athalye and M. Frans Kaashoek and Nickolai Zeldovich and Joseph Tassarotti}, booktitle = {1st {W}orkshop on {P}rogramming {L}anguages and {C}omputer {A}rchitecture ({PLARCH} '23)}, year = 2023, month = jun, }
Storage systems must often store confidential data for their users. It is important to ensure that confidentiality of the stored data is maintained in the presence of bugs and malicious adversaries. This thesis tackles this problem using formal verification, a technique that involves proving a software system always satisfy certain requirements.
There are numerous challenges in specifying what it means a system being confidential and proving that a system satisfies that specification: nondeterministic behavior, indirect leakage of the data, system complexity, and others. Nondeterminism in particular creates unique challenges by making probabilistic leakage possible. This dissertation introduces the following to address these challenges:
Two novel confidentiality specifications for storage systems with nondeterministic behavior: data nonleakage and relatively deterministic noninfluence. Both definitions accommodate discretionary access control and intentional disclosure of the system metadata.
Two techniques accompanying these specifications: sealed blocks and nondeterminism oracles. These techniques addressed the challenges encountered in proving the confidentiality of the systems, and also reduced the proof effort required for said proofs. These techniques are formalized and implemented in two frameworks: DiskSec and ConFrm. Both frameworks contain metatheory to help the developer to prove that their implementation satisfies the specification.
The first confidential, crash-safe, and formally verified file systems with machine-checkable proofs: SFSCQ and ConFs. SFSCQ uses data nonleakage and ConFs uses relatively deterministic noninfluence as their confidentiality specifications. Both are implemented and verified in Coq.
An evaluation shows relatively deterministic noninfluence has 9.2x proof overhead per line of implementation code. Experiments with multiple benchmarks show that our systems perform better compared to FSCQ verified file system but worse compared to ext4 file system.@phdthesis{atalay:thesis, title = {Verifying confidentiality under nondeterminism for storage systems}, author = {Atalay Mert Ileri}, school = {Massachusetts Institute of Technology}, year = 2023, month = feb, }
Grove is a new framework for machine-checked verification of distributed systems. Grove focuses on modular verification. It enables developers to state and prove specifications for their components (e.g. an RPC library), and to use those specifications when proving the correctness of components that build on it (e.g. a key value service built on RPC).
To enable modular specification and verification in a distributed systems, Grove uses the idea of ownership from separation logic. Using Grove, we built a verified unreliable RPC library, where we captured unreliability in the formal specification by using duplicable ownership. We also built a verified exactly-once RPC library, where we reasoned about ownership transfer from the client to server (and back) over an unreliable network by using the escrow pattern.
Overall, we developed and verified an example system written in Go consisting of the RPC libraries, a sharded key-value store with support for dynamically adding new servers and rebalancing shards, a lock service, and a bank application that supports atomic transfers across accounts that live in different shards, built on top of these services. The key-value service scales well with the number of servers and the number of cores per server. The proofs are mechanized in the Coq proof assistant using the Iris library and Goose.
@mastersthesis{sharma-sm, title = {{M}odular {V}erification of {D}istributed {S}ystems with {Grove}}, author = {Upamanyu Sharma}, school = {Massachusetts Institute of Technology}, year = 2022, month = sep, }
This paper develops a new approach to verifying a performant file system that isolates crash safety and concurrency reasoning to a transaction system that gives atomic access to the disk, so that the rest of the file system can be verified with sequential reasoning
We demonstrate this approach in DaisyNFS, a Network File System (NFS) server written in Go that runs on top of a disk. DaisyNFS uses GoTxn, a new verified, concurrent transaction system that extends GoJournal with two-phase locking and an allocator. The transaction system's specification formalizes under what conditions transactions can be verified with only sequential reasoning, and comes with a mechanized proof in Coq that connects the specification to the implementation.
As evidence that proofs enjoy sequential reasoning, DaisyNFS uses Dafny, a sequential verification language, to implement and verify all the NFS operations on top of GoTxn. The sequential proofs helped achieve a number of good properties in DaisyNFS: easy incremental development (for example, adding support for large files), a relatively short proof (only 2× as many lines of proof as code), and a performant implementation (at least 60% the throughput of the Linux NFS server exporting ext4 across a variety of benchmarks).
@inproceedings{daisy-nfs:osdi22, title = {Verifying the {DaisyNFS} concurrent and crash-safe file system with sequential reasoning}, author = {Tej Chajed and Joseph Tassarotti and Mark Theng and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 16th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '22)}, year = 2022, month = jul, }
Knox is a new framework that enables developers to build hardware security modules (HSMs) with high assurance through formal verification. The goal is to rule out all hardware bugs, software bugs, and timing side channels.
Knox’s approach is to relate an implementation’s wire-level behavior to a functional specification stated in terms of method calls and return values with a new definition called information-preserving refinement (IPR). This definition captures the notion that the HSM implements its functional specification, and that it leaks no additional information through its wire-level behavior. The Knox framework provides support for writing specifications, importing HSM implementations written in Verilog and C code, and proving IPR using a combination of lightweight annotations and interactive proofs.
To evaluate the IPR definition and the Knox framework, we verified three simple HSMs, including an RFC 6238-compliant TOTP token. The TOTP token is written in 2950 lines of Verilog and 360 lines of C and assembly. Its behavior is captured in a succinct specification: aside from the definition of the TOTP algorithm, the spec is only 10 lines of code. In all three case studies, verification covers entire hardware and software stacks and rules out hardware/software bugs and timing side channels.
@inproceedings{knox:osdi22, title = {Verifying Hardware Security Modules with Information-Preserving Refinement}, author = {Anish Athalye and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 16th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '22)}, year = 2022, month = jul, }
Critical systems software such as the file system is challenging to make correct due to the combination of concurrency in the implementation for good performance and the requirement to preserve data even on crash, where the whole computer stops and reboots unexpectedly. To build reliable systems, this thesis extends formal verification — proving that a system always meets a mathematical specification of its behavior — to reason about the concurrency and crash safety.
The thesis applies the new verification techniques to the verification of DaisyNFS, a new concurrent file system. The file system is an important service of an operating system, because nearly all persistent data is ultimately stored in a file system and bugs can lead to permanent data loss in any application running on top. Another contribution of the thesis is a system design and verification techniques to scale verification to a system of the size and complexity of a file system. In particular, the file system is designed around a transaction system that addresses the core challenges of crashes and concurrency, so that the rest of the code can be verified with comparatively simpler sequential reasoning.
An evaluation of proof overhead finds that verification required 2× as many lines of proof as code for the sequential reasoning, compared to 20× for the crash safety and concurrency proofs. A performance evaluation finds that DaisyNFS achieves good performance compared to Linux NFS exporting ext4 over a range of benchmarks: at least 60% the throughput even on the most challenging concurrent benchmarks, and 90% on other workloads.
@phdthesis{tchajed:thesis, title = {Verifying a concurrent, crash-safe file system with sequential reasoning}, author = {Tej Chajed}, school = {Massachusetts Institute of Technology}, year = 2022, month = may, }
Despite the enormous success of cloud computing, programming and deploying cloud applications remains challenging. Application developers are forced to either explicitly provision resources or limit the types of applications they write to fit a serverless framework such as AWS Lambda.
𝜎OS is a new multi-tenant cloud operating system that allows providers to manage resources for tenants while simplifying application development. A key contribution of 𝜎OS is its novel abstraction: realms. Realms present tenants with the illusion of a single-system image and abstract boundaries between physical machines. Developers structure their applications as processes, called procs in 𝜎OS. Much like a time-sharing OS multiplexes users’ processes across a machine’s cores, 𝜎OS multiplexes tenants’ procs across the cloud provider’s physical machines. Since each tenant tends to plan for peak load, realms can improve data center utilization by enabling providers to transparently reallocate partial machines to another tenant’s realm when load dips.
An evaluation of 𝜎OS demonstrates that a 𝜎OS-based MapReduce (𝜎OS-MR) implementation grows quickly from 1 core to 32 and scales near-perfectly achieving 15.26× speedup over the same implementation running on 2 cores. Similarly, an elastic Key-Value service built on 𝜎OS (𝜎OS-KV) cooperates with 𝜎OS to scale the number of kvd servers and balance shards across them, according to client load. 𝜎OS also achieves high resource utilization when multiple tenants’ realms compete for a shared group of machines. For example, when 𝜎OS multiplexes a long-running 𝜎OS-MR job in one realm and a 𝜎OS-KV service with varying numbers of clients in another realm, 𝜎OS keeps utilization above 90% and transparently moves partial machines between the realms as the 𝜎OS-KV client load changes
@mastersthesis{sigmaos:arielck-ms-thesis, title = {{𝜎OS}: Elastic {Realms} for Multi-Tenant Cloud Computing}, author = {Ariel Szekely}, school = {Massachusetts Institute of Technology}, year = 2022, month = may, }
This thesis considers the challenge of defending flight software from radiation errors without a radiation-hardened processor. A new real-time operating system, Vivid, explores the use of redundant multithreading to protect critical software components from radiation errors, and offers new abstractions to reduce the number of single points of vulnerability in the system. It introduces a static component initialization system for C, which eliminates most runtime initialization steps from the operating system and flight software. It introduces a partition scheduler based on execution clips, which ensures that software components always start from a safe state, and it protects the system’s safe state using a pair of memory scrubbers. Vivid introduces voting ducts, an inter-process communication primitive for redundant multithreading that eliminates single points of vulnerability from the voting process. Finally, it defines a sequence of repair that ultimately grounds the correct operation of all components in the system’s software in a hardware watchdog.
To demonstrate the applicability and effectiveness of Vivid, this thesis introduces Swivel, a testbench spacecraft, and describes SwivelFSW, which is the implementation of flight software that meets Swivel’s behavioral requirements, and SwivelSim, which is the simulation of Swivel’s avionics. Next, this thesis introduces Hailburst, a system for efficient processor emulation and radiation fault injection, and uses it to evaluate Vivid’s radiation tolerance through a series of accelerated radiation injection trials. In the tested configuration, Vivid tolerates approximately 149 out of every 150 injected radiation faults without any observed requirement failures, and recovers from the remaining 1 out of 150 radiation faults within at most 2.05 seconds of recovery time in the worst observed case. Because some of Vivid’s defenses appear to be more effective than others, and some may be counterproductive, this thesis discusses future work that would be required before Vivid’s abstractions could be applied to real-world flight software.
@mastersthesis{cela-meng, title = {{Vivid}: An Operating System Kernel for Radiation-Tolerant Flight Control Software}, author = {Cel Skeggs}, school = {Massachusetts Institute of Technology}, year = 2022, month = may, }
Bugs related to concurrency and crash safety are infamous for being subtle and hard to reproduce. Formal verification provides a way to combat such bugs through the use of machine-checked proofs about program behavior. However, reasoning about concurrency and crashes can be tricky, especially when scaling up to larger systems that must also have good performance.
This thesis discusses the verification of GoTxn, the concurrent, crash-safe transaction system underlying the verified Network File System (NFS) server DaisyNFS. It focuses on the specification and proof of the write-ahead log and the automatic two-phase locking interface used to enforce crash and concurrent atomicity in transactions, detailing how the verification framework Perennial can be used to manage assertions about crash behavior across multiple threads. By effectively harnessing concurrency to hide disk access latency, GoTxn enables performance in DaisyNFS similar to the unverified Linux NFS server.
@mastersthesis{mtheng-meng, title = {{GoTxn}: Verifying a Crash-Safe, Concurrent Transaction System}, author = {Mark Theng}, school = {Massachusetts Institute of Technology}, year = 2022, month = jan, }
This thesis makes two contributions: (1) a measurement study of the performance evolution of mitigations against transient execution attacks over generations of processors, and (2) the Ward kernel design, which eliminates as much as half the overhead of mitigations on older processors.
The measurement study maps end-to-end overheads to the specific mitigations that cause them. It reveals that hardware fixes for several transient execution attacks have reduced overheads on OS heavy workloads by a factor of ten. However, overheads for JavaScript applications have remained roughly flat because they are caused by mitigations for attacks that even the most recent processors are still vulnerable to. Finally, the study shows that a few mitigations account for most performance costs.
Ward is a novel operating system architecture that is resilient to transient execution attacks, yet avoids expensive software mitigations that existing operating systems employ when running on pre-2018 processors. It leverages a new hardware/software contract termed the Unmapped Speculation Contract, which describes limits on the speculative behavior of processors.
@phdthesis{behrensj:thesis, title = {Understanding and Improving the Performance of Mitigating Transient Execution Attacks}, author = {Jonathan Behrens}, school = {Massachusetts Institute of Technology}, year = 2022, month = jan, }
@inproceedings{pir:eurocrypt22, title = {Single-Server Private Information Retrieval with Sublinear Amortized Time}, author = {Corrigan-Gibbs, Henry and Henzinger, Alexandra and Kogan, Dmitry}, booktitle = {41st Annual International Conference on the Theory and Applications of Cryptographic Techniques ({EUROCRYPT} '22)}, year = 2022, }
The main contribution of this paper is GoJournal, a verified, concurrent journaling system that provides atomicity for storage applications, together with Perennial 2.0, a framework for formally specifying and verifying concurrent crash-safe systems. GoJournal’s goal is to bring the advantages of journaling for code to specs and proofs. Perennial 2.0 makes this possible by introducing several techniques to formalize GoJournal’s specification and to manage the complexity in the proof of GoJournal’s implementation. Lifting predicates and crash framing make the specification easy to use for developers, and logically atomic crash specifications allow for modular reasoning in GoJournal, making the proof tractable despite complex concurrency and crash interleavings.
GoJournal is implemented in Go, and Perennial is implemented in the Coq proof assistant. While verifying GoJournal, we found one serious concurrency bug, even though GoJournal has many unit tests. We built a functional NFSv3 server, called GoNFS, to use GoJournal. Performance experiments show that GoNFS provides similar performance (e.g., at least 90% throughput across several benchmarks on an NVMe disk) to Linux’s NFS server exporting an ext4 file system, suggesting that GoJournal is a competitive journaling system. We also verified a simple NFS server using GoJournal’s specs, which confirms that they are helpful for application verification: a significant part of the proof doesn’t have to consider concurrency and crashes.
@inproceedings{gojournal:osdi21, title = {{GoJournal}: a verified, concurrent, crash-safe journaling system}, author = {Tej Chajed and Joseph Tassarotti and Mark Theng and Ralf Jung and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 15th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '21)}, year = 2021, month = jul, }
Providing privacy in complex, data-rich applications is hard. Deleting accounts, anonymizing an account's contributions, and other privacy-related actions may require the traversal and transformation of interwoven state in a relational database. Finding the affected data is already nontrivial, but privacy actions must additionally balance competing requirements, such as preserving data trails for legal reasons or allowing users to change their mind. We believe a systematic shared framework for specifying and implementing privacy transformations could simplify and empower applications. Our prototype, data disguising, supports fine-grained, nuanced, and useful policies that would be cumbersome to implement manually, including
@inproceedings{tslilyai:hotos21, author = {Tsai, Lillian and Schwarzkopf, Malte and Kohler, Eddie}, title = {Privacy Heroes Need Data Disguises}, month = jun, year = 2021, pages = {112–118}, booktitle = {Proceedings of the 18th {W}orkshop on {H}ot {T}opics in {O}perating {S}ystems ({HotOS XVIII})}, address = {Ann Arbor, Michigan}, }
This paper presents rtlv, an approach for push-button formal verification of properties that involve software running on hardware for many cycles. For example, rtlv can be used to prove that executing boot code resets a processor's microarchitectural state to known deterministic values, while existing tools time out when attempting to verify such a property.
Two key ideas enable rtlv to handle reasoning about many cycles of circuit execution. First, rtlv uses hybrid symbolic execution to reason about a circuit with symbolic values while minimizing the complexity of symbolic expressions; this is achieved by compiling circuits to programs in the Rosette solver-aided language. Second, rtlv enables the development of reusable circuit-agnostic property checkers that have a performance hint interface, allowing developers to optimize verification performance while maintaining confidence that the proof is correct.
Using rtlv, formally verifying a state-clearing property for a small (1,300 flip-flop) RISC-V SoC takes only 1.3 seconds, while SymbiYosys, a popular open-source verification tool, is unable to finish within 12 hours. In another case study, rtlv scales to a larger 4,300 flip-flop RISC-V SoC where verifying this state-clearing property requires modeling over 20,000 cycles of software executing on hardware. Formal verification with rtlv helped us find and fix violations of the property in the baseline hardware, demonstrating that rtlv is useful for finding bugs.
@inproceedings{rtlv:carrv21, title = {{rtlv}: push-button verification of software on hardware}, author = {Noah Moroze and Anish Athalye and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 5th {W}orkshop on {C}omputer {A}rchitecture {R}esearch with {RISC-V} ({CARRV} 2021)}, year = 2021, month = jun, }
Transaction authentication is an attractive extension to two-factor authentication. It is proposed in the WebAuthn standard by the World-Wide-Web Consortium (W3C) as a mechanism to secure individual "high-risk" operations of a website via a hardware authenticator device. It defends against a stringent threat model where an adversary can modify or create HTTP requests between the user and the web service. Transaction authentication as defined by WebAuthn is not yet adopted in practice, partially because it requires intrusive web application changes.
This thesis presents Guarda, a firewall for integrating transaction authentication into a new or existing web service with relatively few code changes. The firewall intercepts all HTTP traffic sent to the web service, and based on the configuration, any requests deemed safe are proxied directly to the web service. All other requests are considered high-risk and are held back and validated using transaction authentication. Only if the validation passes are they also permitted to pass through to the web service.
This thesis uses the firewall approach to integrate transaction authentication into three web applications: a blogging site named Conduit, a WordPress admin panel named Calypso and a self-hosted Git service named Gogs. Compared to directly modifying them to support transaction authentication, the firewall approach is close to 8 times more concise. Under heavy load, there is an associated latency of at worst 1.5x slower when using Guarda to secure Gogs versus accessing the web service directly without WebAuthn.
@mastersthesis{barabonkov-meng, title = {Guarda: A web application firewall for {WebAuthn} transaction authentication}, author = {Damian Barabonkov}, school = {Massachusetts Institute of Technology}, year = 2021, month = may, }
Notary uses formal verification to prove a hardware-level security property called deterministic start for a simple system-on-chip (SoC). Deterministic start requires that an SoC's state is fully reset by boot code to ensure that secrets cannot leak across reset boundaries. However, Notary's approach has several limitations. Its security property requires that all of the SoC's microarchitectural state can be reset to known values through software, and the property and proof technique apply only to SoCs with a single clock domain. These limitations prevent Notary's approach from being applied to more complex systems.
This thesis addresses these limitations through Kronos, a system consisting of a verified SoC that satisfies a new security property called output determinism. Output determinism provides the same security guarantees as Notary without requiring that all of an SoC's state be reset by software. The SoC used in Kronos, called MicroTitan, is based on the open-source OpenTitan and includes multiple clock domains. This thesis evaluates Kronos and demonstrates that existing open-source hardware can be modified to satisfy output determinism with minimal changes, and that the process of proving output determinism reveals hardware issues that violate desired security guarantees.
@mastersthesis{moroze-meng, title = {Kronos: Verifying leak-free reset for a system-on-chip with multiple clock domains}, author = {Noah Moroze}, school = {Massachusetts Institute of Technology}, year = 2021, month = jan, }
Modern datacenter applications are composed of hundreds of microservices with high degrees of fanout. As a result, they are sensitive to tail latency and require high request throughputs. Maintaining these characteristics under overload is difficult, especially for RPCs with short service times. In this paper, we consider the challenging case of microsecond-scale RPCs, where the cost of communicating information and dropping a request is similar to the cost of processing a request. We present Breakwater, an overload control scheme that can prevent overload in microsecond-scale services through a new, server-driven admission control scheme that issues credits based on server-side queueing delay. Breakwater contributes several techniques to amortize communication costs. It engages in demand speculation, where it assumes clients have unmet demand and issues additional credits when the server is not overloaded. Moreover, it piggybacks client-side demand information in RPC requests and credits in RPC responses. To cope with the occasional bursts in load caused by demand speculation, Breakwater drops requests when overloaded using active queue management. When clients’ demand spikes unexpectedly to 1.4x capacity, Breakwater converges to stable performance in less than 20 ms with no congestion collapse while DAGOR and SEDA take 500 ms and 1.58 s to recover from congestion collapse, respectively.
@inproceedings{breakwater:osdi20, title = {Overload Control for us-scale {RPC}s with {Breakwater}}, author = {Inho Cho and Ahmed Saeed and Joshua Fried and Seo Jin Park and Mohammad Alizadeh and Adam Belay}, booktitle = {Proceedings of the 14th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)}, year = 2020, month = nov, }
The conventional wisdom is that CPU resources such as cores, caches, and memory bandwidth must be partitioned to achieve performance isolation between tasks. Both the widespread availability of cache partitioning in modern CPUs and the recommended practice of pinning latency-sensitive applications to dedicated cores attest to this belief.
In this paper, we show that resource partitioning is neither necessary nor sufficient. Many applications experience bursty request patterns or phased behavior, drastically changing the amount and type of resources they need. Unfortunately, partitioning-based systems fail to react quickly enough to keep up with these changes, resulting in extreme spikes in latency and lost opportunities to increase CPU utilization.
Caladan is a new CPU scheduler that can achieve significantly better quality of service (tail latency, throughput, etc.) through a collection of control signals and policies that rely on fast core allocation instead of resource partitioning. Caladan consists of a centralized scheduler core that actively manages resource contention in the memory hierarchy and between hyperthreads, and a kernel module that bypasses the standard Linux Kernel scheduler to support microsecond-scale monitoring and placement of tasks. When colocating memcached with a best-effort, garbage-collected workload, Caladan outperforms Parties, a state-of-the-art resource partitioning system, by 11,000x, reducing tail latency from 580 ms to 52 µs during shifts in resource usage while maintaining high CPU utilization.
@inproceedings{caladan:osdi20, title = {Caladan: Mitigating Interference at Microsecond Timescales}, author = {Joshua Fried and Zhenyuan Ruan and Amy Ousterhout and Adam Belay}, booktitle = {Proceedings of the 14th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)}, year = 2020, month = nov, }
Today’s kernels pay a performance penalty for mitigations—such as KPTI, retpoline, return stack stuffing, speculation barriers—to protect against transient execution side-channel attacks such as Meltdown and Spectre .
To address this performance penalty, this paper articulates the unmapped speculation contract, an observation that memory that isn’t mapped in a page table cannot be leaked through transient execution. To demonstrate the value of this contract, the paper presents Ward, a new kernel design that maintains a separate kernel page table for every process. This page table contains mappings for kernel memory that is safe to expose to that process. Because a process doesn’t map data of other processes, this design allows for many system calls to execute without any mitigation overhead. When a process needs access to sensitive data, Ward switches to a kernel page table that provides access to all of memory and executes with all mitigations.
An evaluation of the Ward design implemented in the sv6 research kernel shows that can execute many system calls without mitigations. For some hardware generations, this results in performance improvement ranging from a few percent (huge page fault) to several factors (getpid), compared to a standard design with mitigations.
@inproceedings{ward:osdi20, title = {Efficiently Mitigating Transient Execution Attacks using the Unmapped Speculation Contract}, author = {Jonathan Behrens and Anton Cao and Cel Skeggs and Adam Belay and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 14th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)}, year = 2020, month = nov, }
Memory is the most contended and least elastic resource in datacenter servers today. Applications can use only local memory---which may be scarce---even though memory might be readily available on another server. This leads to unnecessary killings of workloads under memory pressure and reduces effective server utilization.
We present application-integrated far memory (AIFM), which makes remote, "far" memory available to applications through a simple API and with high performance. AIFM achieves the same common-case access latency for far memory as for local RAM; it avoids read and write amplification that paging-based approaches suffer; it allows data structure engineers to build remoteable, hybrid near/far memory data structures; and it makes far memory transparent and easy to use for application developers.
Our key insight is that exposing application-level semantics to a high-performance runtime makes efficient remoteable memory possible. Developers use AIFM's APIs to make allocations remoteable, and AIFM's runtime handles swapping objects in and out, prefetching, and memory evacuation.
We evaluate AIFM with a prototypical web application frontend, a NYC taxi data analytics workload, a memcached-like key-value cache, and Snappy compression. Adding AIFM remoteable memory to these applications increases their available memory without performance penalty. AIFM outperforms Fastswap, a state-of-the-art kernel-integrated, paging-based far memory system by up to 61X.
@inproceedings{aifm:osdi20, title = {{AIFM}: High-Performance, Application-Integrated Far Memory}, author = {Zhenyuan Ruan and Malte Schwarzkopf and Marcos K. Aguilera and Adam Belay}, booktitle = {Proceedings of the 14th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)}, year = 2020, month = nov, }
This thesis proposes a practical database system that lowers latency and increases supported load for read-heavy applications by using incrementally-maintained materialized views to cache query results. As opposed to state-of-the-art materialized view systems, the presented system builds the cache on demand, keeps it updated, and evicts cache entries in response to a shifting workload.
The enabling technique the thesis introduces is partially stateful materialization, which allows entries in materialized views to be missing. The thesis proposes upqueries as a mechanism to fill such missing state on demand using dataflow, and implements them in the materialized view system Noria. The thesis details issues that arise when dataflow updates and upqueries race with one another, and introduces mechanisms that uphold eventual consistency in the face of such races.
Partial materialization saves application developers from having to implement ad hoc caching mechanisms to speed up their database accesses. Instead, the database has transparent caching built in. Experimental results suggest that the presented system increases supported application load by up to 20x over MySQL and performs similarly to an optimized key-value store cache. Partial state also reduces memory use by up to 2/3 compared to traditional materialized views.
@phdthesis{jfrg:thesis, title = {Partial State in Dataflow-Based Materialized Views}, author = {Jon Gjengset}, school = {Massachusetts Institute of Technology}, year = 2020, month = nov, }
Modern web applications store data in backend databases, and access it through a variety of frontend queries. User permissions are implemented by checks on those queries, but a maliciously injected (or simply buggy) query can easily leak private data. Multiverse databases attempt to prevent these data leaks by creating a separate view of the database contents (or "universe") for each user, and enforcing in the backend that this universe contains only data that the user is allowed to query. These views are precomputed and materialized using a streaming dataflow system so that queries return promptly.
This design is difficult to make efficient. A simple approach makes copies of data and operators for each universe, but state size that increases proportionally to the number of users quickly becomes impractical. In this work, we developed optimizations for multiverse dataflow graphs, which aim to reuse the same computations (i.e. dataflow subgraphs) in many different universes while maintaining security invariants.
We evaluate these optimizations in the context of the HotCRP and Piazza web applications. The resulting graphs are about 2x more space-efficient and 3x more computation-efficient than the naive ones. Graph size and processing time still scale linearly with the number of users, so our design may still not be efficient enough to be practical, but our optimizations make progress toward making multiverse databases a feasible solution to web application security.
@mastersthesis{bredenberg-meng, title = {Optimizations for Performant Multiverse Databases}, author = {Jacqueline M. Bredenberg}, school = {Massachusetts Institute of Technology}, year = 2020, month = may, }
Goose is a tool for importing programs written in a subset of Go into Coq. Proving properties in Coq about a translated Go program implies that these properties hold for the original Go program, assuming that Goose is correct. Testing the correctness of Goose is challenging, however, because the Coq translation, called GooseLang, is not an executable semantics.
This thesis presents Waddle, a testing framework for Goose. Waddle consists of an interpreter for GooseLang, which makes GooseLang programs executable, a proof that the interpreter implements GooseLang correctly, and a framework for running GooseLang programs through the interpreter and checking that they return the same result as their corresponding Goose program. This thesis evaluates Waddle as a test-driven development framework and as a bug finding tool, and describes several bugs that Waddle has found.
@mastersthesis{gibsons-meng, title = {Waddle: A proven interpreter and test framework for a subset of the {Go} semantics}, author = {Sydney Gibson}, school = {Massachusetts Institute of Technology}, year = 2020, month = may, }
Notary is a new design for a hardware wallet, a type of security token that is used to protect sensitive transactional operations like cryptocurrency transfers. Notary aims to be more secure than past hardware wallets by eliminating classes of bugs by design and by formally proving the correctness of the key operation used in its implementation. We built a physical prototype of Notary and showed that it achieves functionality similar to existing hardware wallets while avoiding many bugs that affect them.
@article{notary:login20, title = {{Notary}: A Device for Secure Transaction Approval}, author = {Anish Athalye and Adam Belay and M. Frans Kaashoek and Robert Morris and Nickolai Zeldovich}, year = 2020, pages = {12--16}, volume = 45, number = 1, month = mar, journal = {{{;login: The USENIX Magazine}}}, }
This dissertation presents a scalable approach to protecting metadata (who is communicating with whom) in a communication system. The emphasis in this dissertation is on hiding metadata for voice calls, but the approach is applicable to any two-way communication between users.
Our approach is embodied in a new system named Yodel, the first system for voice calls that hides metadata from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been difficult to achieve in prior work where processing each message requires an expensive public key operation at each hop in the network. Yodel avoids this expense with the idea of self-healing circuits, reusable paths through a mix network that use only fast symmetric cryptography. Once created, these circuits are resilient to passive and active attacks from global adversaries. Creating and connecting to these circuits without leaking metadata is another challenge that Yodel addresses with the idea of guarded circuit exchange, where each user creates a backup circuit in case an attacker tampers with their traffic. We evaluate Yodel across the internet and it achieves acceptable voice quality with 990 ms of latency for 5 million simulated users.
@phdthesis{lazard:thesis, title = {Strong and Scalable Metadata Security for Voice Calls}, author = {David Lazar}, school = {Massachusetts Institute of Technology}, year = 2020, month = feb, }
Yodel is the first system for voice calls that hides metadata (e.g. who is communicating with whom) from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been difficult to achieve in prior work where processing each message requires an expensive public key operation at each hop in the network. Yodel avoids this expense with the idea of self-healing circuits, reusable paths through a mix network that use only fast symmetric cryptography. Once created, these circuits are resilient to passive and active attacks from global adversaries. Creating and connecting to these circuits without leaking metadata is another challenge that Yodel addresses with the idea of guarded circuit exchange, where each user creates a backup circuit in case an attacker tampers with their traffic. We evaluate Yodel across the internet and it achieves acceptable voice quality with 990ms of latency for 5 million simulated users.
@inproceedings{yodel:sosp19, title = {{Yodel}: Strong Metadata Security for Voice Calls}, author = {David Lazar and Yossi Gilad and Nickolai Zeldovich}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019)}, year = 2019, month = oct, address = {Hunstville, ON, Canada}, }
Notary is a new hardware and software architecture for running isolated approval agents in the form factor of a USB stick with a small display and buttons. Approval agents allow factoring out critical security decisions, such as getting the user’s approval to sign a Bitcoin transaction or to delete a backup, to a secure environment. The key challenge addressed by Notary is to securely switch between agents on the same device. Prior systems either avoid the problem by building single-function devices like a USB U2F key, or they provide weak isolation that is susceptible to kernel bugs, side channels, or Rowhammer-like attacks. Notary achieves strong isolation using reset-based switching, along with the use of physically separate systems-on-a-chip for agent code and for the kernel, and a machine-checked proof of both the hardware’s register-transfer-level design and software, showing that reset-based switching leaks no state. Notary also provides a trustworthy I/O path between the agent code and the user, which prevents an adversary from tampering with the user’s screen or buttons.
We built a hardware/software prototype of Notary, using a combination of ARM and RISC-V processors. The prototype demonstrates that it is feasible to verify Notary’s reset-based switching, and that Notary can support diverse agents, including cryptocurrencies and a transaction approval agent for traditional client-server applications such as websites. Measurements of reset-based switching show that it is fast enough for interactive use. We analyze security bugs in existing cryptocurrency hardware wallets, which aim to provide a similar form factor and feature set as Notary, and show that Notary’s design avoids many bugs that affect them.
@inproceedings{notary:sosp19, title = {{Notary}: A Device for Secure Transaction Approval}, author = {Anish Athalye and Adam Belay and M. Frans Kaashoek and Robert Morris and Nickolai Zeldovich}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019)}, year = 2019, month = oct, address = {Hunstville, ON, Canada}, }
This paper introduces Perennial, a framework for verifying concurrent, crash-safe systems. Perennial extends the Iris concurrency framework with three techniques to enable crash-safety reasoning: recovery leases, recovery helping, and versioned memory. To ease development and deployment of applications, Perennial provides Goose, a subset of Go and a translator from that subset to a model in Perennial with support for reasoning about Go threads, data structures, and file-system primitives. We implemented and verified a crash-safe, concurrent mail server using Perennial and Goose that achieves speedup on multiple cores. Both Perennial and Iris use the Coq proof assistant, and the mail server and the framework’s proofs are machine checked.
@inproceedings{perennial:sosp19, title = {Verifying Concurrent, Crash-safe Systems with {Perennial}}, author = {Tej Chajed and Joseph Tassarotti and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019)}, year = 2019, month = oct, address = {Hunstville, ON, Canada}, }
Streaming dataflow systems offer an appealing alternative to classic MySQL / memcached web backend stacks. But websites must not go down, and current fault tolerance techniques for dataflow systems either come with long downtimes during recovery, or fail to scale to large deployments due to the overhead of global coordination. For example, in the failure of a single dataflow node, existing lineage-based techniques take a long time to recompute all lost and downstream state, while checkpointing techniques require costly global coordination for rollback recovery.
This thesis presents a causal logging approach to fault tolerance that rolls back and replays the execution of only the failed node, without any global coordination. The key to knowing how to replay a valid execution while ensuring exactly-once semantics is a small, constant-size tree clock piggybacked onto each message, incurring runtime overheads that are low and scalable. After recovery, the state of the system is indistinguishable from one that never failed at all.
We implement and evaluate the protocol on Noria, a streaming dataflow backend for read-heavy web applications. Compared to Noria’s original protocol of lineage-based recovery, tree clock recovery time is constant in relation to state size and graph size. Experimental results show sub-second recovery times with 1.5ms runtime overheads, which translates to a 290x improvement in recovery time.
@mastersthesis{yuan-meng, title = {Scalable Fault Tolerance for High-Performance Streaming Dataflow}, author = {Gina Yuan}, school = {Massachusetts Institute of Technology}, year = 2019, month = sep, }
Our research seeks to understand this problem space – what widely-used abstractions are convenient for developers, and efficiently implemented, but hinder compliance? We are designing new, fundamentally privacy-centric computer systems abstractions that seek to achieve compliance with GDPR-like legislation by default.
@inproceedings{schwarzkopf2019gdprcbyc, author = {Schwarzkopf, Malte and Kohler, Eddie and Kaashoek, M. Frans and Morris, Robert}, title = {{GDPR} Compliance by Construction}, booktitle = {Proceedings of the 2019 VLDB Workshop Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data (Poly)}, month = aug, year = 2019, }
Storage systems make persistence guarantees even if the system crashes at any time, which they achieve using recovery procedures that run after a crash. We present Argosy, a framework for machine-checked proofs of storage systems that supports layered recovery implementations with modular proofs. Reasoning about layered recovery procedures is especially challenging because the system can crash in the middle of a more abstract layer's recovery procedure and must start over with the lowest-level recovery procedure.
This paper introduces recovery refinement, a set of conditions that ensure proper implementation of an interface with a recovery procedure. Argosy includes a proof that recovery refinements compose, using Kleene algebra for concise definitions and metatheory. We implemented Crash Hoare Logic, the program logic used by FSCQ, to prove recovery refinement, and demonstrated the whole system by verifying an example of layered recovery featuring a write-ahead log running on top of a disk replication system. The metatheory of the framework, the soundness of the program logic, and these examples are all verified in the Coq proof assistant.
@inproceedings{argosy:pldi19, title = {Argosy: Verifying Layered Storage Systems with Recovery Refinement}, author = {Tej Chajed and Joseph Tassarotti and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 40th {ACM} {SIGPLAN} Conference on Programming Language Design and Implementation ({PLDI} 2019)}, year = 2019, month = jun, address = {Phoenix, Arizona}, }
This dissertation presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.
This dissertation contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.
On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 582 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.
@phdthesis{biscuit:thesis, title = {The Benefits and Costs of Writing a POSIX Kernel in a High-Level Language}, author = {Cody Cutler}, school = {Massachusetts Institute of Technology}, year = 2019, month = jun, }
A multiverse database transparently presents each application user with a flexible, dynamic, and independent view of shared data. This transformed view of the entire database contains only information allowed by a centralized and easily-auditable privacy policy. By enforcing the privacy policy once, in the database, multiverse databases reduce programmer burden and eliminate many frontend bugs that expose sensitive data.
Multiverse databases' per-user transformations risk expensive queries if applied dynamically on reads, or impractical storage requirements if the database proactively materializes policy-compliant views. We propose an efficient design based on a joint dataflow across "universes" that combines global, shared computation and cached state with individual, per-user processing and state. This design, which supports arbitrary SQL queries and complex policies, imposes no performance overhead on read queries. Our early prototype supports thousands of parallel universes on a single server.
@inproceedings{multiversedb:hotos19, title = {Towards Multiverse Databases}, author = {Alana Marzoev and Lara Timb\'{o} Ara\'{u}jo and Malte Schwarzkopf and Samyukta Yagati and Eddie Kohler and Robert Morris and M. Frans Kaashoek and Sam Madden}, pages = {88-95}, month = may, year = 2019, booktitle = {Proceedings of the 17th {W}orkshop on {H}ot {T}opics in {O}perating {S}ystems ({HotOS XVII})}, address = {Bertinoro, Italy}, }
MCQC is a compiler for extracting verified systems programs to low-level assembly, with no runtime or garbage collection requirements and an emphasis on performance. MCQC targets the Gallina functional language used in the Coq proof assistant. MCQC translates pure and recursive functions into C++17, while compiling monadic effectful functions to imperative C++ system calls. With a few memory and performance optimizations, MCQC combines verifiability with memory and runtime performance. By handling effectful and pure functions separately MCQC can generate executable verified code directly from Gallina, reducing the effort of implementing and executing verified systems.
@inproceedings{mcqc:nfm19, title = {{MCQC}: Extracting and optimizing formally verified code for systems programming}, author = {Eleftherios Ioannidis and Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 11th {A}nnual {NASA} {F}ormal {M}ethods {S}ymposium ({NFM})}, year = 2019, month = may, address = {Houston, TX}, }
Secure Multi-Party Computation (MPC) allows mutually distrusting parties to run joint computations without revealing private data. Current MPC algorithms scale poorly with data size, which makes MPC on "big data" prohibitively slow and inhibits its practical use.
Many relational analytics queries can maintain MPC's end-to-end security guarantee without using cryptographic MPC techniques for all operations. Conclave is a query compiler that accelerates such queries by transforming them into a combination of data-parallel, local cleartext processing and small MPC steps. When parties trust others with specific subsets of the data, Conclave applies new hybrid MPC-cleartext protocols to run additional steps outside of MPC and improve scalability further.
Our Conclave prototype generates code for cleartext processing in Python and Spark, and for secure MPC using the Sharemind and Obliv-C frameworks. Conclave scales to data sets between three and six orders of magnitude larger than state-of-the-art MPC frameworks support on their own. Thanks to its hybrid protocols and additional optimizations, Conclave also substantially outperforms SMCQL, the most similar existing system.
@inproceedings{conclave:eurosys19, title = {{Conclave}: secure multi-party computation on big data}, author = {Nikolaj Volgushev and Malte Schwarzkopf and Ben Getchell and Mayank Varia and Andrei Lapets and Azer Bestavros}, note = {An extended version is available as arxiv:1902.06288 [cs.CR]}, booktitle = {Proceedings of the ACM EuroSys Conference (EuroSys 2019)}, year = 2019, month = apr, address = {Dresden, Germany}, }
Noria, first presented at OSDI’18, is a new web application back-end that delivers the same fast reads as an in-memory cache in front of the database, but without the application having to manage the cache. Even better, Noria still accepts SQL queries and allows changes to the queries without extra effort, just like a database. Noria performs well: it serves up to 14M requests per second on a single server, and supports a 5x higher load than carefully hand-tuned queries issued to MySQL.
@article{noria:login19, title = {Noria: A New Take on Fast Web Application Backends}, author = {Jon Gjengset and Malte Schwarzkopf and Jonathan Behrens and Lara Timb{\'o} Ara{\'u}jo and Martin Ek and Eddie Kohler and M. Frans Kaashoek and Robert Morris}, year = 2019, pages = {17--21}, volume = 44, number = 1, month = mar, journal = {{{;login: The USENIX Magazine}}}, }
This article summarizes the main benefits and downsides of using a garbage-collected high-level language to build an operating system kernel with enough functionality to run Redis and Nginx with good performance.
@article{biscuit:login19, title = {The benefits and costs of writing a {POSIX} kernel in a high-level language}, author = {Cody Cutler and M. Frans Kaashoek and Robert T. Morris}, year = 2019, pages = {6--10}, volume = 44, number = 1, month = mar, journal = {{{;login: The USENIX Magazine}}}, }
Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today’s kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load.
Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity—every 5 µs—enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.
@inproceedings{shenango:nsdi19, title = {Shenango: Achieving High {CPU} Efficiency for Latency-sensitive Datacenter Workloads}, author = {Amy Ousterhout and Joshua Fried and Jonathan Behrens and Adam Belay and Hari Balakrishnan}, booktitle = {Proceedings of the 16th {USENIX} {S}ymposium on {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '19)}, year = 2019, month = feb, address = {Boston, MA}, }
@mastersthesis{elefthei-meng, title = {Extracting and Optimizing Low-Level Bytecode from High-level Verified {Coq}}, author = {Eleftherios Ioannidis}, school = {Massachusetts Institute of Technology}, year = 2019, month = feb, }
We introduce partially-stateful data-flow, a new streaming data-flow model that supports eviction and reconstruction of data-flow state on demand. By avoiding state explosion and supporting live changes to the data-flow graph, this model makes data-flow viable for building long-lived, low-latency applications, such as web applications. Our implementation, Noria, simplifies the back-end infrastructure for read-heavy web applications while improving their performance.
A Noria application supplies a relational schema and a set of parameterized queries, which Noria compiles into a data-flow program that pre-computes results for reads and incrementally applies writes. Noria makes it easy to write high-performance applications without manual performance tuning or complex-to-maintain caching layers. Partial statefulness helps Noria limit its in-memory state without prior data-flow systems' restriction to windowed state, and helps Noria adapt its data-flow to schema and query changes while on-line. Unlike prior data-flow systems, Noria also shares state and computation across related queries, eliminating duplicate work.
On a real web application's queries, our prototype scales to 5x higher load than a hand-optimized MySQL baseline. Noria also outperforms a typical MySQL/memcached stack and the materialized views of a commercial database. It scales to tens of millions of reads and millions of writes per second over multiple servers, outperforming a state-of-the-art streaming data-flow system.
@inproceedings{noria:osdi18, title = {Noria: dynamic, partially-stateful data-flow for high-performance web applications}, author = {Jon Gjengset and Malte Schwarzkopf and Jonathan Behrens and Lara Timb{\'o} Ara{\'u}jo and Martin Ek and Eddie Kohler and M. Frans Kaashoek and Robert Morris}, year = 2018, pages = {213--231}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)}, month = oct, address = {Carlsbad, California}, }
Karaoke is a system for low-latency metadata-private communication. Karaoke provides differential privacy guarantees, and scales better with the number of users than prior such systems (Vuvuzela and Stadium). Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users. Overall, Karaoke’s latency is 5x to 10x better than Vuvuzela and Stadium.
@inproceedings{karaoke:osdi18, title = {Karaoke: Distributed Private Messaging Immune to Passive Traffic Analysis}, author = {David Lazar and Yossi Gilad and Nickolai Zeldovich}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)}, year = 2018, month = oct, address = {Carlsbad, California}, }
SFSCQ is the first file system with a machine-checked proof of security. To develop, specify, and prove SFSCQ, this paper introduces DiskSec, a novel approach for reasoning about confidentiality of storage systems, such as a file system. DiskSec addresses the challenge of specifying confidentiality using the notion of data noninterference to find a middle ground between strong and precise information-flow-control guarantees and the weaker but more practical discretionary access control. DiskSec factors out reasoning about confidentiality from other properties (such as functional correctness) using a notion of sealed blocks. Sealed blocks enforce that the file system treats confidential file blocks as opaque in the bulk of the code, greatly reducing the effort of proving data noninterference. An evaluation of SFSCQ shows that its theorems preclude security bugs that have been found in real file systems, that DiskSec imposes little performance overhead, and that SFSCQ's incremental development effort, on top of DiskSec and DFSCQ, on which it is based, is moderate.
@inproceedings{disksec:osdi18, title = {Proving confidentiality in a file system using {DiskSec}}, author = {Atalay Ileri and Tej Chajed and Adam Chlipala and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)}, year = 2018, month = oct, address = {Carlsbad, California}, }
Writing concurrent systems software is error-prone, because multiple processes or threads can interleave in many ways, and it is easy to forget about a subtle corner case. This paper introduces CSPEC, a framework for formal verification of concurrent software, which ensures that no corner cases are missed. The key challenge is to reduce the number of interleavings that developers must consider. CSPEC uses mover types to re-order commutative operations so that usually it's enough to reason about only sequential executions rather than all possible interleavings. CSPEC also makes proofs easier by making them modular using layers, and by providing a library of reusable proof patterns. To evaluate CSPEC, we implemented and proved the correctness of CMAIL, a simple concurrent Maildir-like mail server that speaks SMTP and POP3. The results demonstrate that CSPEC's movers and patterns allow reasoning about sophisticated concurrency styles in CMAIL.
@inproceedings{cspec:osdi18, title = {Verifying concurrent software using movers in {CSPEC}}, author = {Tej Chajed and M. Frans Kaashoek and Butler Lampson and Nickolai Zeldovich}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)}, year = 2018, month = oct, address = {Carlsbad, California}, }
This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.
The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.
On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.
@inproceedings{biscuit:osdi18, title = {The benefits and costs of writing a {POSIX} kernel in a high-level language}, author = {Cody Cutler and M. Frans Kaashoek and Robert T. Morris}, year = 2018, pages = {89--105}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)}, month = oct, address = {Carlsbad, California}, }
This paper explains a flaw in the published proof of the Scalable Commutativity Rule (SCR), presents a revised and formally verified proof of the SCR in the Coq proof assistant, and discusses the insights and open questions raised from our experience proving the SCR.
@techreport{tsai-scr-proof, title = {A Revised and Verified Proof of the Scalable Commutativity Rule}, author = {Lillian Tsai and Eddie Kohler and M. Frans Kaashoek and Nickolai Zeldovich}, year = 2018, month = sep, institution = {{MIT} Computer Science and Artificial Intelligence Laboratory}, howpublished = {arXiv:1809.09550}, }
Most modern web applications authenticate users and enforce security policies in the ap- plication logic. Therefore, buggy applications can easily leak sensitive data. MultiverseDB addresses this problem in a new database architecture, where each user has her own private view of the database and declarative security policies restrict data-flow into a user’s private universe.
To support multi-user universes, MultiverseDB builds on ideas of streaming data-flow systems and low-overhead materialized views. When a new user session starts, the system creates data-flow nodes to support the user queries and automatically inserts special nodes to enforce security policies at universe boundaries. MultiverseDB provides fast reads by storing the pre-computed results of user queries with policies already applied in incremen- tally maintained materialized views. To reduce space overheads created by these views and avoid redundant processing, MultiverseDB reuses views and allows system administrators to specify security groups for users subjected to the same security policies.
@mastersthesis{larat-meng, title = {Multiverse Databases for Secure Web Applications}, author = {Lara Timb\'{o} Ara\'{u}jo}, school = {Massachusetts Institute of Technology}, year = 2018, month = feb, }
All popular web browsers offer a “private browsing
mode.” After a private session terminates, the browser is supposed
to remove client-side evidence that the session occurred. Unfortunately,
browsers still leak information through the file system, the
browser cache, the DNS cache, and on-disk reflections of RAM
such as the swap file.
Veil is a new deployment framework that allows web developers
to prevent these information leaks, or at least reduce their
likelihood. Veil leverages the fact that, even though developers do
not control the client-side browser implementation, developers do
control 1) the content that is sent to those browsers, and 2) the
servers which deliver that content. Veil web sites collectively store
their content on Veil’s blinding servers instead of on individual,
site-specific servers. To publish a new page, developers pass their
HTML, CSS, and JavaScript files to Veil’s compiler; the compiler
transforms the URLs in the content so that, when the page loads
on a user’s browser, URLs are derived from a secret user key.
The blinding service and the Veil page exchange encrypted data
that is also protected by the user’s key. The result is that Veil
pages can safely store encrypted content in the browser cache;
furthermore, the URLs exposed to system interfaces like the
DNS cache are unintelligible to attackers who do not possess the
user’s key. To protect against post-session inspection of swap file
artifacts, Veil uses heap walking (which minimizes the likelihood
that secret data is paged out), content mutation (which garbles
in-memory artifacts if they do get swapped out), and DOM
hiding (which prevents the browser from learning site-specific
HTML, CSS, and JavaScript content in the first place). Veil pages
load on unmodified commodity browsers, allowing developers to
provide stronger semantics for private browsing without forcing
users to install or reconfigure their machines. Veil provides these
guarantees even if the user does not visit a page using a browser’s
native privacy mode; indeed, Veil’s protections are stronger than
what the browser alone can provide.
@inproceedings{veil:ndss18, title = {Veil: Private Browsing Semantics Without Browser-side Assistance}, author = {Frank Wang and James Mickens and Nickolai Zeldovich}, booktitle = {Proceedings of the {N}etwork and {D}istributed {S}ystem {S}ecurity {S}ymposium ({NDSS} '18)}, year = 2018, month = feb, address = {San Diego, California}, }
Algorand is a new cryptocurrency that confirms transactions with latency on the order of a minute while scaling to many users. Algorand ensures that users never have divergent views of confirmed transactions, even if some of the users are malicious and the network is temporarily partitioned. In contrast, existing cryptocurrencies allow for temporary forks and therefore require a long time, on the order of an hour, to confirm transactions with high confidence.
Algorand uses a new Byzantine Agreement (BA) protocol to reach consensus among users on the next set of transactions. To scale the consensus to many users, Algorand uses a novel mechanism based on Verifiable Random Functions that allows users to privately check whether they are selected to participate in the BA to agree on the next set of transactions, and to include a proof of their selection in their network messages. In Algorand's BA protocol, users do not keep any private state except for their private keys, which allows Algorand to replace participants immediately after they send a message. This mitigates targeted attacks on chosen participants after their identity is revealed.
We implement Algorand and evaluate its performance on 1,000 EC2 virtual machines, simulating up to 500,000 users. Experimental results show that Algorand confirms transactions in under a minute, achieves 125× Bitcoin's throughput, and incurs almost no penalty for scaling to more users.
@inproceedings{algorand:sosp17, title = {Algorand: Scaling Byzantine Agreements for Cryptocurrencies}, author = {Yossi Gilad and Rotem Hemo and Silvio Micali and Georgios Vlachos and Nickolai Zeldovich}, pages = {51--68}, booktitle = {Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 2017)}, year = 2017, month = oct, address = {Shanghai, China}, }
Private communication over the Internet remains a challenging problem. Even if messages are encrypted, it is hard to deliver them without revealing metadata about which pairs of users are communicating. Scalable anonymity systems, such as Tor, are susceptible to traffic analysis attacks that leak metadata. In contrast, the largest-scale systems with metadata privacy require passing all messages through a small number of providers, requiring a high operational cost for each provider and limiting their deployability in practice.
This paper presents Stadium, a point-to-point messaging system that provides metadata and data privacy while scaling its work efficiently across hundreds of low-cost providers operated by different organizations. Much like Vuvuzela, the current largest-scale metadata-private system, Stadium achieves its provable guarantees through differential privacy and the addition of noisy cover traffic. The key challenge in Stadium is limiting the information revealed from the many observable traffic links of a highly distributed system, without requiring an overwhelming amount of noise. To solve this challenge, Stadium introduces techniques for distributed noise generation and differentially private routing as well as a verifiable parallel mixnet design where the servers collaboratively check that others follow the protocol. We show that Stadium can scale to support 4× more users than Vuvuzela using servers that cost an order of magnitude less to operate than Vuvuzela nodes.
@inproceedings{stadium:sosp17, title = {Stadium: A Distributed Metadata-Private Messaging System}, author = {Nirvan Tyagi and Yossi Gilad and Derek Leung and Matei Zaharia and Nickolai Zeldovich}, pages = {423--440}, booktitle = {Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 2017)}, year = 2017, month = oct, address = {Shanghai, China}, }
DFSCQ is the first file system that (1) provides a precise specification for fsync and fdatasync, which allow applications to achieve high performance and crash safety, and (2) provides a machine-checked proof that its implementation meets this specification. DFSCQ's specification captures the behavior of sophisticated optimizations, including log-bypass writes, and DFSCQ's proof rules out some of the common bugs in file-system implementations despite the complex optimizations.
The key challenge in building DFSCQ is to write a specification for the file system and its internal implementation without exposing internal file-system details. DFSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, which roughly follows the behavior of Linux ext4. This specification uses a notion of tree sequences—logical sequences of file-system tree states—for succinct description of the possible states after a crash and to describe how data writes can be reordered with respect to metadata updates. This helps application developers prove the crash safety of their own applications, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory.
An evaluation shows that DFSCQ achieves 103 MB/s on large file writes to an SSD and durably creates small files at a rate of 1,618 files per second. This is slower than Linux ext4 (which achieves 295 MB/s for large file writes and 4,977 files/s for small file creation) but much faster than two recent verified file systems, Yggdrasil and FSCQ. Evaluation results from application-level benchmarks, including TPC-C on SQLite, mirror these microbenchmarks.
@inproceedings{dfscq:sosp17, title = {Verifying a high-performance crash-safe file system using a tree specification}, author = {Haogang Chen and Tej Chajed and Alex Konradi and Stephanie Wang and Atalay Ileri and Adam Chlipala and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 2017)}, year = 2017, month = oct, address = {Shanghai, China}, }
It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, even for commutative operations like creating different files in the same directory, current file systems introduce cache-line conflicts when updating an in-memory copy of the on-disk directory block, which limits scalability.
ScaleFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without cache conflicts and hence scale perfectly. ScaleFS logs operations in a per-core log so that it can delay propagating updates to the disk representation (and the cache-line conflicts involved in doing so) until an fsync. The fsync call merges the per-core logs and applies the operations to disk. ScaleFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache-line conflicts, absorption of logged operations, and dependency tracking across operations.
Experiments with a prototype of ScaleFS show that its implementation has no cache conflicts for 99% of test cases of commutative operations generated by Commuter, scales well on an 80-core machine, and provides on-disk performance that is comparable to that of Linux ext4.
@inproceedings{scalefs:sosp17, title = {Scaling a file system to many cores using an operation log}, author = {Srivatsa S. Bhat and Rasha Eqbal and Austin T. Clements and M. Frans Kaashoek and Nickolai Zeldovich}, booktitle = {Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 2017)}, year = 2017, month = oct, address = {Shanghai, China}, }
As more of the world moves towards online technologies, users are exposed to the increasing threat of cyberattacks. Studies show that most of these attacks begin with a phishing attack. Phishing emails and websites may compromise user credentials or download unsolicited and malicious software.
This thesis presents the design and implementation of Quboid, a workstation for safer web interaction. Quboid helps users better defend against phishing attacks by providing several security mechanisms. The design of Quboid is based on the principle of isolation and restricted communication. The system enforces isolation by using virtualization to restrict browser instances to show different websites in separate virtual machines. For example, Quboid isolates a user’s bank website and social networking website in separate VMs. It uses deep-packet inspection to implement a HTTP/HTTPS proxy filter to ensure virtual machines only communicate with specific web servers. It also provides users with a secure interface and provides cues to help them recognize phishing attacks.
@mastersthesis{amol-meng, title = {{Quboid}: A workstation for safer Web interaction}, author = {Amol M. Bhave}, school = {Massachusetts Institute of Technology}, year = 2017, month = sep, }
Developing software that scales on multicore processors is an inexact science dominated by guesswork, measure- ment, and expensive cycles of redesign and reimplementation. Current approaches are workload-driven and, hence, can reveal scalability bottlenecks only for known workloads and available software and hardware. This paper introduces an interface-driven approach to building scalable software. This approach is based on the scalable commutativity rule, which, informally stated, says that whenever interface operations commute, they can be implemented in a way that scales. We formalize this rule and prove it correct for any machine on which conflict-free operations scale, such as current cache-coherent multicore machines. The rule also enables a better design process for scalable software: programmers can now reason about scalability from the earli- est stages of interface definition through software design, implementation, and evaluation.
@article{commutativity:cacm, title = {The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors}, author = {Austin T. Clements and M. Frans Kaashoek and Nickolai Zeldovich and Robert T. Morris and Eddie Kohler}, journal = {CACM}, volume = 60, number = 8, month = aug, year = 2017, }
Formal verification of software has become a powerful tool for creating software systems and proving their correctness. While such systems provide strong guarantees about their behavior, they frequently exhibit poor performance relative to their unverified counterparts. Verified file systems are not excepted, and their poor performance limits their utility. These limitations, however, are not intrinsic to verification techniques, but are the result of designing for proofs, not performance.
This thesis proposes a design for large files and in-memory caches that are amenable both to a high-performance implementation and proofs of correctness. It then describes their usage in VDFS, a verified high-performance file system with deferred durability guarantees. The evaluation of VDFS’ performance shows that these additions measurably improve performance over previous verified file systems, and make VDFS competitive with unverified file system implementations. This thesis contributes proven implementation techniques for large files and in-memory caches that can be applied to increase performance of verified systems.
@mastersthesis{akonradi-meng, title = {Performance {Optimization} of the {VDFS} {Verified} {File} {System}}, author = {Alex Konradi}, school = {Massachusetts Institute of Technology}, year = 2017, month = jun, }
Over the last decade, systems software verification has become increasingly practical. Many verified systems have been written in the language of a proof assistant, proved correct, and then made runnable using code extraction. However, due to the rigidity of extraction and the overhead of the target languages, the resulting code’s CPU performance can suffer, with limited opportunity for optimization. This thesis contributes CoqGo, a proof-producing compiler from Coq's Gallina language to Go. We created Go′, a stylized semantics of Go which enforce linearity, and implemented proofproducing compilation tactics from Gallina to Go′ plus a straightforward translation from Go′ to Go. Applying a prototype of CoqGo, we compiled a system call in the FSCQ file system, with minimal changes to FSCQ's source code. Taking advantage of the increased control given by CoqGo, we implemented three optimizations, bringing the system call’s CPU performance to 19% faster than the extracted version.
@mastersthesis{ziegler-meng, title = {Compiling {Gallina} to {Go} for the {FSCQ} File System}, author = {Daniel Ziegler}, school = {Massachusetts Institute of Technology}, year = 2017, month = jun, }
Implementing distributed systems correctly is difficult. Designing correct distributed systems protocols is challenging because designs must account for concurrent operation and handle network and machine failures. Implementing these protocols is challenging as well: it is difficult to avoid subtle bugs in implementations of complex protocols. Formal verification is a promising approach to ensuring distributed systems are free of bugs, but verification is challenging and time-consuming. Unfortunately, current approaches to mechanically verifying distributed systems in proof assistants using deductive verification do not allow for modular reasoning, which could greatly reduce the effort required to implement verified distributed systems by enabling reuse of code and proofs.
This thesis presents CoqIOA, a framework for reasoning about distributed systems in a compositional way. CoqIOA builds on the theory of input/output automata to support specification, proof, and composition of systems within the proof assistant. The framework's implementation of the theory of IO automata, including refinement, simulation relations, and composition, are all machine-checked in the Coq proof assistant. An evaluation of CoqIOA demonstrates that the framework enables compositional reasoning about distributed systems within the proof assistant.
@mastersthesis{aathalye-meng, title = {{CoqIOA}: A Formalization of {IO} {Automata} in the {Coq} {Proof} {Assistant}}, author = {Anish Athalye}, school = {Massachusetts Institute of Technology}, year = 2017, month = jun, }
It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, data structures that are on separate cache lines in memory (e.g., directory entries) are grouped together in a transaction log when the file system writes them to disk. This grouping results in cache line conflicts, thereby limiting scalability.
McoreFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without conflicts and hence scale perfectly. McoreFS logs operations in a per-core log so that it can delay propagating updates to the disk representation until an fsync. The fsync call merges the per-core logs and applies the operations to disk. McoreFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache line conflicts, absorption of logged operations, and dependency tracking across operations.
Experiments with a prototype of McoreFS show that its implementation is conflict-free for 99% of test cases involving commutative operations generated by Commuter, scales well on an 80-core machine, and provides disk performance that matches or exceeds that of Linux ext4.
@mastersthesis{srivatsa-sm, title = {Designing multicore scalable filesystems with durability and crash consistency}, author = {Srivatsa S. Bhat}, school = {Massachusetts Institute of Technology}, year = 2017, month = jun, }
FSCQ is the first file system with a machine-checkable proof that its implementation meets a specification, even in the presence of fail-stop crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover its state correctly without losing data.
To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.
@article{fscq:cacm, title = {Certifying a File System Using {Crash Hoare Logic}: Correctness in the Presence of Crashes}, author = {Tej Chajed and Haogang Chen and Adam Chlipala and M. Frans Kaashoek and Nickolai Zeldovich and Daniel Ziegler}, journal = {CACM}, volume = 60, number = 4, month = apr, year = 2017, }
Many online services let users query public datasets such as maps, flight prices, or restaurant reviews. Unfortunately, the queries to these services reveal highly sensitive information that can compromise users’ privacy. This paper presents Splinter, a system that protects users’ queries on public data and scales to realistic applications. A user splits her query into multiple parts and sends each part to a different provider that holds a copy of the data. As long as any one of the providers is honest and does not collude with the others, the providers cannot determine the query. Splinter uses and extends a new cryptographic primitive called Function Secret Sharing (FSS) that makes it up to an order of magnitude more efficient than prior systems based on Private Information Retrieval and garbled circuits. We develop protocols extending FSS to new types of queries, such as MAX and TOPK queries. We also provide an optimized implementation of FSS using AES-NI instructions and multicores. Splinter achieves end-to-end latencies below 1.6 seconds for realistic workloads including a Yelp clone, flight search, and map routing.
@inproceedings{splinter:nsdi17, title = {Splinter: Practical Private Queries on Public Data}, author = {Frank Wang and Catherine Yun and Shafi Goldwasser and Vinod Vaikuntanathan and Matei Zaharia}, booktitle = {Proceedings of the 14th {USENIX} {S}ymposium on {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '17)}, year = 2017, month = mar, address = {Boston, MA}, }
Systems software is a good target for verification due to its prevalent usage and its complexity, which can lead to tricky bugs that are hard to test for. One source of complexity in systems software is concurrency, but thus far verification techniques have struggled to enable large-scale verification of concurrent systems. This thesis contributes a verified file system, CIO-FSCQ, with I/O concurrency: if a file system call experiences a miss in the buffer cache and starts a disk I/O, the file system overlaps the I/O with the execution of another file system call.
CIO-FSCQ re-uses the implementation, specifications, and proofs of an existing verified sequential file, FSCQ, and turns it into an I/O-concurrent file system. This re-use is enabled by CIO-FSCQ's optimistic system calls. An optimistic system call runs sequentially if all the data it needs is in the buffer cache. If some data is not in the cache, CIO-FSCQ issues I/Os to retrieve the data from disk and returns an error code. In the miss case, a system call wrapper reverts any partial changes and yields the processor so that another system call can run in parallel with the I/O. CIO-FSCQ retries the system call later, at which point the data is likely in the buffer cache. A directory-isolation protocol guarantees that FSCQ's specifications and proofs can be re-used even if optimistic system calls are retried. An evaluation of IO-FSCQ shows that it speeds up a simple file-system workload by overlapping disk I/O with computation, and that the effort of building and verifying CIO-FSCQ is small compared to the effort of verifying FSCQ.
@mastersthesis{tchajed-sm, title = {Verifying an {I/O}-Concurrent File System}, author = {Tej Chajed}, school = {Massachusetts Institute of Technology}, year = 2017, month = feb, }
In principle, the web should provide the perfect stage for user-generated content, allowing users to share their data seamlessly with other users across services and applications. In practice, the web fragments a user’s data over many sites, each exposing only limited APIs for sharing.
This paper describes Oort, a new cloud storage system that organizes data primarily by user rather than by application or web site. Oort allows users to choose which web software to use with their data and which other users to share it with, while giving applications powerful tools to query that data. Users rent space from providers that cooperate to provide a global, federated, general-purpose storage system. To support large-scale, multi-user applications such as Twitter and e-mail, Oort provides global queries that find and combine data from relevant users across all providers.
Oort makes global query execution efficient by recognizing and merging similar queries issued by many users’ application instances, largely eliminating the per-user factor in the global complexity of queries. Our evaluation predicts that an Oort implementation could handle traffic similar to that seen by Twitter using a hundred cooperating Oort servers, and that applications with other sharing patterns, like e-mail, can also be executed efficiently.
@techreport{oort:tr16, title = {{Oort}: User-Centric Cloud Storage with Global Queries}, author = {Tej Chajed and Jon Gjengset and M. Frans Kaashoek and James Mickens and Robert Morris and Nickolai Zeldovich}, institution = {{MIT} Computer Science and Artificial Intelligence Laboratory}, number = {MIT-CSAIL-TR-2016-015}, year = 2016, month = dec, }
Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization.
This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at sub-second placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations.
Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20× over Quincy [22], a prior centralized scheduler using the same MCMF optimization. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used centralized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6x.
@inproceedings{firmament:osdi16, title = {Firmament: fast, centralized cluster scheduling at scale}, author = {Ionel Gog and Malte Schwarzkopf and Adam Gleave and Robert N. M. Watson and Steven Hand}, pages = {99--115}, booktitle = {Proceedings of the 12th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)}, year = 2016, month = nov, address = {Savannah, Georgia}, }
Alpenhorn is the first system for initiating an encrypted connection between two users that provides strong privacy and forward secrecy guarantees for metadata (i.e., information about which users connected to each other) and that does not require out-of-band communication other than knowing the other user's Alpenhorn username (email address). This resolves a significant shortcoming in all prior works on private messaging, which assume an out-of-band key distribution mechanism.
Alpenhorn's design builds on three ideas. First, Alpenhorn provides each user with an address book of friends that the user can call to establish a connection. Second, when a user adds a friend for the first time, Alpenhorn ensures the adversary does not learn the friend's identity, by using identity-based encryption in a novel way to privately determine the friend's public key. Finally, when calling a friend, Alpenhorn ensures forward secrecy of metadata by storing pairwise shared secrets in friends' address books, and evolving them over time, using a new keywheel construction. Alpenhorn relies on a number of servers, but operates in an anytrust model, requiring just one of the servers to be honest.
We implemented a prototype of Alpenhorn, and integrated it into the Vuvuzela private messaging system (which did not previously provide privacy or forward secrecy of metadata when initiating conversations). Experimental results show that Alpenhorn can scale to many users, supporting 10 million users on three Alpenhorn servers with an average call latency of 150 seconds and a client bandwidth overhead of 3.7 KB/sec.
@inproceedings{alpenhorn:osdi16, title = {Alpenhorn: Bootstrapping Secure Communication without Leaking Metadata}, author = {David Lazar and Nickolai Zeldovich}, pages = {571--586}, booktitle = {Proceedings of the 12th {USENIX} {S}ymposium on {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)}, year = 2016, month = nov, address = {Savannah, Georgia}, }
File systems are a cornerstone for storing and retrieving permanent data, yet they are complex enough to have bugs that might cause data loss, especially in the face of system crashes.
FSCQ is the first file system that (1) provides a precise specification for the core subset of POSIX file-system APIs; and the APIs include fsync and fdatasync, which allow applications to achieve high I/O performance and crash safety, and that (2) provides a machine-checked proof that its I/O-efficient implementation meets this precise specification. FSCQ’s proofs avoid crash-safety bugs that have plagued file systems, such as forgetting to insert a disk-write barrier between writing the data from the log and writing the log’s commit block. FSCQ’s specification also allows applications to prove their own crash safety, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory. As a result, applications on FSCQ can provide strong guarantees: they will not lose data under any sequence of crashes.
To state FSCQ’s theorems, FSCQ introduces the Crash Hoare Logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, the thesis developed, specified, and proved the correctness of the FSCQ file system. FSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, based on Linux ext4’s behavior. FSCQ also introduces disk sequences and disk relations to help formalize the metadata-prefix specification. The evaluation shows that FSCQ enables end-to-end verification of application crash safety, and that FSCQ’s optimizations achieve I/O performance on par with that of Linux ext4.
@phdthesis{hchen-phd, title = {Certifying a Crash-safe File System}, author = {Haogang Chen}, school = {Massachusetts Institute of Technology}, year = 2016, month = sep, }
As more and more software is written every day, so too are bugs. Formal verification is a way of using mathematical methods to prove that a program has no bugs. However, if formal verification is to see widespread use, it must be able to compete with unverified software in performance. Unfortunately, many of the optimizations that we take for granted in unverified software depend on assumptions that are difficult to verify. One such optimization is data checksums in logging systems, used to improve I/O efficiency while still ensuring data integrity after a crash.
This thesis explores a novel method of modeling the probabilistic guarantees of a hash function. This method is then applied to the logging system underlying RapidFSCQ, a certified crash-safe filesystem, to support formally verified checksums. An evaluation of RapidFSCQ shows that it enables end-to-end verification of application and filesystem crash safety, and that RapidFSCQ's optimizations, including checksumming, achieve I/O performance on par with Linux ext4. Thus, this thesis contributes a formal model of hash function behavior with practical application to certified computer systems.
@mastersthesis{stephanie-meng, title = {Certifying Checksum-Based Logging in the {RapidFSCQ} Crash-Safe Filesystem}, author = {Stephanie Wang}, school = {Massachusetts Institute of Technology}, year = 2016, month = jun, }
This paper studies undefined behavior arising in systems programming languages such as C/C++. Undefined behavior bugs lead to unpredictable and subtle systems behavior, and their effects can be further amplified by compiler optimizations. Undefined behavior bugs are present in many systems, including the Linux kernel and the Postgres database. The consequences range from incorrect functionality to missing security checks.
This paper proposes a formal and practical approach, which finds undefined behavior bugs by finding “unstable code” in terms of optimizations that leverage undefined behavior. Using this approach, we introduce a new static checker called Stack that precisely identifies undefined behavior bugs. Applying Stack to widely used systems has uncovered 161 new bugs that have been confirmed and fixed by developers.
@article{kstack:cacm, title = {A Differential Approach to Undefined Behavior Detection}, author = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek and Armando Solar-Lezama}, journal = {CACM}, volume = 60, number = 3, month = mar, year = 2016, }
Modern web services rob users of low-level control over cloud storage — a user's single logical data set is scattered across multiple storage silos whose access controls are set by web services, not users. The consequence is that users lack the ultimate authority to determine how their data is shared with other web services.
In this paper, we introduce Sieve, a new platform which selectively (and securely) exposes user data to web services. Sieve has a user-centric storage model: each user uploads encrypted data to a single cloud store, and by default, only the user knows the decryption keys. Given this storage model, Sieve defines an infrastructure to support rich, legacy web applications. Using attribute-based encryption, Sieve allows users to define intuitively understandable access policies that are cryptographically enforceable. Using key homomorphism, Sieve can reencrypt user data on storage providers in situ, revoking decryption keys from web services without revealing new keys to the storage provider. Using secret sharing and two-factor authentication, Sieve protects cryptographic secrets against the loss of user devices like smartphones and laptops. The result is that users can enjoy rich, legacy web applications, while benefiting from cryptographically strong controls over which data a web service can access.
@inproceedings{sieve:nsdi16, title = {Sieve: Cryptographically Enforced Access Control for User Data in Untrusted Clouds}, author = {Frank Wang and James Mickens and Nickolai Zeldovich and Vinod Vaikuntanathan}, booktitle = {Proceedings of the 13th {USENIX} {S}ymposium on {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '16)}, year = 2016, month = mar, address = {Santa Clara, CA}, }
Private messaging over the Internet has proven challenging to implement, because even if message data is encrypted, it is difficult to hide metadata about who is communicating in the face of traffic analysis. Systems that offer strong privacy guarantees, such as Dissent, scale to only several thousand clients, because they use techniques with superlinear cost in the number of clients (e.g., each client broadcasts their message to all other clients). On the other hand, scalable systems, such as Tor, do not protect against traffic analysis, making them ineffective in an era of pervasive network monitoring.
Vuvuzela is a new scalable messaging system that offers strong privacy guarantees, hiding both message data and metadata. Vuvuzela is secure against adversaries that observe and tamper with all network traffic, and that control all nodes except for one server. Vuvuzela's key insight is to minimize the number of variables observable by an attacker, and to use differential privacy techniques to add noise to all observable variables in a way that provably hides information about which users are communicating. Vuvuzela has a linear cost in the number of clients, and experiments show that it can achieve a throughput of 68,000 messages per second for 1 million users with a 37-second end-to-end latency on commodity servers.