Scientific workflows contain an increasing number of interacting
applications, often with big disparity between the formats of data
being produced and consumed by different applications. This mismatch
can result in performance degradation as data retrieval causes
multiple read operations (often to a remote storage system) in order
to convert the data. In recent years, with the large increase in the
amount of data and computational power available there is demand for
applications to support data access in-situ, or close-to simulation to
provide application steering, analytics and visualization.
Although some parallel filesystems and middleware
libraries attempt to identify access patterns and optimize data
retrieval, they frequently fail if the patterns are complex. It is
evident that more knowledge of the structure of the datasets at the
storage systems level will provide many opportunities for further
performance improvements.
For most developers of scientific applications, storing the
application data, and its particular format on disk, is not an
essential part of the application. Although they acknowledge the
importance of the I/O performance, their expertise lies mostly in
numerical simulations and the particular models their application
simulates. Most of their efforts are spent of ensuring that the
it produces correct numerical results. Ideally, they would like to be
able to have a library call that reads a subset of the data from storage (no
matter what its format is), and place it in the data structures the
simulation defines in the computer memory. Since the data needs to be
analyzed and visualized, and the data has to be accessible from
third-party tools, the scientists are forced to know more about the
data formats.
In this dissertation we investigate multiple techniques for utilizing
dataset description for improving performance and overall data
availability for HPC applications. We introduce a declarative data
description language that can be used to define the complete dataset
as well as parts of it. These descriptions are used to generate
transformation rules that allow data to be converted between different
physical layouts on storage and in memory.
First, we define the DRepl dataset description language and use it to
implement divergent data views and replicas as POSIX files. We
evaluate the performance for this approach and demonstrate its
advantages both because of the transparent application use, and
combined performance when the application is combined with analytics
and/or visualization code that reads the data in different format.
DRepl decouples the data producers and consumers and the data layouts
they use from the way the data is stored on the storage system.
DRepl has shown up to 2x for cumulative performance when data is
accessed using optimized replicas.
Second, we extend the previous approach to the parallel environment
used in HPC. Instead of using POSIX files, the new method allows data
to be accessed in larger chunks (fragments) in the way it will be laid
out in memory. The developers can define what data structures they
have in the process' memory and the overall format of the dataset on
storage, and the runtime will automatically take care of transforming
the data between the two. Both the formats in memory and on disk are
described with the DRepl language. Replacing the ability for reading
the data as an array of bytes with operations that use descriptions of
the data structure, provides better opportunities for the
storage system to optimize the access to the persistent data. The
integration of this technique in Ceph demonstrates the potential
advantages for this approach. The experiments show performance
improvements up to 5 times for writes and 10 times for reads, compared
to collective MPI I/O.
Third, we explore the future directions of extending the DRepl
language to support more complex datasets. The additions would allow
scientists to use different resolutions for different parts of a
multi-dimensional spaces, and define how to transform the data between
resolutions. The changes would also allow completely abstract
definitions of datasets not only for continuums, but also for
primitive types like real and integer numbers. The fragments of the
dataset that are present in memory or disk will have concrete
types that are compatible with the abstract types used in the dataset.
Finally, we provide foundations on how to extend the previous
functionality to the most complicated data structures used in
scientific applications -- unstructured meshes.