Large distributed storage systems such as High Performance Computing (HPC) systems used by national or international laboratories must deliver performance for demanding scientific workloads that pose challenges from scale, complexity, and dynamism. Ideally, data is placed in locations and network routes are set to optimize performance, but the aforementioned challenges of large scientific systems and networks inhibits adaptive performance tuning on a large scale.
If the data is not placed or routed with the potential future access patterns, the system risks having popular data stored on a slow file system. This can cause a steep drop in performance when access demand shifts to that data. Ideally, determining when and how to move data around in anticipation of future demand spikes can prevent steep drops in performance, and we refer to these drops as performance bottlenecks. We can leverage the fact that workloads shift through time, creating access patterns that can be used to forecast demand. Past accesses can be used to reveal important information when determining future accesses.
An existing solution for determining where to move the data is to use access logs to create system performance models that can predict how the system reacts to fluctuations in demand. The drawback is that the search space of how to build an effective model is massive. The more there are locations where the data can be stored to larger the search space becomes. Additionally, applying the model to a system can potentially negatively impact any performance benefits of dynamically moving the data around the system. If a model is not set up correctly, the added overhead of using the model on the system may overshadow the benefit of applying the model on a target system. Ideally, the models created by the system engineers place and route the data to optimize performance, but workloads shift can be hard to predict using traditional methods such as heuristics.
This dissertation applies and enhances online learning to predict and accelerate the performance of data movements and accesses in large scale storage systems. We focus on large scientific systems and networks like the CERN EOS, Pacific Northwest National Laboratory’s (PNNL) BlueSky system and the Caltech Tier 2 system. These systems allow for us to stress test our models with real data. Additionally it allows us to collect information about how accesses are distributed and executed on real systems. From the collected information we are able to create models based on real workloads. We are also able to reduce the training overhead of our models by determining which features may not bring added information to a model during training, and removing them. We developed a methodology to select features over time that is resilient to data collection noise added by new data collected from the system.
Access logs created by system engineers often track hundreds of metrics describing its daily operation, and we must sift through these metrics to discover the most relevant metrics for the modeling task at hand. To explore different feature selection techniques, we have developed WinnowML, a system of automatically determining the most relevant feature subset for a specific modeling methodology when modeling system performance over time. We use WinnowML to determine what combination of existing techniques allow us to get a stable selected subset of features which will not vary significantly over time while keeping a low prediction error. Using the created list of features, system analysts can determine what features should be used when modeling an aspect of their system over time. Using WinnowML lowered the resulting mean absolute error by 13.6\% on average compared to the closest performing approach such as L1-regularization or Principal Component Analysis (PCA).
To optimize the placement of data, we developed Geomancy, a tool that models the placement of data within a distributed storage system and reacts to drops in performance. Additionally to optimize the network routing decisions, we developed Diolkos, a tool that dynamically reroutes data flow in response to drops in performance. Using a combination of machine learning techniques suitable for temporal modeling, Geomancy and Diolkos determines when and where a bottleneck may happen due to changing workloads and suggests changes in the layout and routing decisions to mitigate or prevent them.
Using WinnowML to determine which features to use when training our Geomancy tool, the predicted data layouts of Geomancy offered benefits for storage systems such as avoiding potential bottlenecks and increasing overall I/O throughput from 11\% to 30\%. It managed to free up resources that other workloads running concurrently could use. We then moved on to tackle the data transfer overhead in scientific networks. There exist several techniques that reroute data within scientific networks while improving network performance, leveraging latent parallel data transfer capabilities. The issue with these techniques is that they require a central authority, which may add computational and network overhead to the system. We propose a decentralized rerouting technique that exists at the switch level. When we applied Diolkos to the Caltech Tier 2 network, we found that our most accurate model, a dense model with one hidden layer trained using all the ports, increases throughput of switches up to 49\% compared to the best performing heuristic approach, exponentially weighted moving average, and up to 28\% when using a traditional controller.