Building system software is a notoriously complex and arduous endeavor.
Developing tools and methodologies for practical system software engineering
has long been an active area of research. This thesis explores system software
development through the lens of a declarative, data-centric programming
language that can succinctly express high-level system specifications and be
directly compiled to executable code. By unifying specification and
implementation, our approach avoids the common problem of implementations
diverging from specifications over time. In addition, we show that using a
declarative language often results in drastic reductions in code size (100× and
more) relative to procedural languages like Java and C++. We demonstrate these
advantages by implementing a host of functionalities at various levels of the
system hierarchy, including network protocols, query optimizers, and scheduling
policies. In addition to providing a compact and optimized implementation, we
demonstrate that our declarative implementations often map very naturally to
traditional specifications: in many cases they are line-by-line translations of
published pseudcode.
We started this work with the hypothesis that declarative languages --
originally developed for the purposes of data management and querying -- could
be fruitfully adapted to the specification and implementation of core system
infrastructure. A similar argument had been made for networking protocols a
few years earlier [61]. However, our goals were quite different: we wanted to
explore a broader range of algorithms and functionalities (dynamic programming,
scheduling, program rewriting, and system auditing) that were part of complex,
real-world software systems. We identified two existing system components --
query optimizers in a DBMS and task schedulers in a cloud computing system --
that we felt would be better specified via a declarative language. Given our
interest in delivering real-world software, a key challenge was identifying the
right system boundary that would permit meaningful declarative implementations
to coexist within existing imperative system architectures. We found that
relations were a natural boundary for maintaining the ongoing system state on
which the imperative and declarative code was based, and provided an elegant
way to model system architectures.
This thesis explores the boundaries of declarative systems via two projects.
We begin with Evita Raced; an extensible compiler for the Overlog language used
in our declarative networking system, P2. Evita Raced is a metacompiler -- an
Overlog compiler written in Overlog -- that integrates seamlessly with the P2
dataflow architecture. We first describe the minimalist design of Evita Raced,
including its extensibility interfaces and its reuse of the P2 data model and
runtime engine. We then demonstrate that a declarative language like Overlog
is well-suited to expressing traditional and novel query optimizations as well
as other program manipulations, in a compact and natural fashion. Following
Evita Raced, we describe the initial work in BOOM Analytics, which began as a
large-scale experiment at building "cloud" software in a declarative language.
Specifically, we used the Overlog language to implement a "Big Data" analytics
stack that is API-compatible with the Hadoop MapReduce architecture and
provides comparable performance. We extended our declarative version of Hadoop
with complex distributed features that remain absent in the stock Hadoop Java
implementation, including alternative scheduling policies, online aggregation,
continuous queries, and unique monitoring and debugging facilities. We present
quantitative and anecdotal results from our experience, providing concrete
evidence that both data-centric design and declarative languages can
substantially simplify systems programming.