- Dong, Bin;
- Wang, Teng;
- Tang, Houjun;
- Koziol, Quincey;
- Wu, Kesheng;
- Byna, Suren
- Editor(s): Abe, Naoki;
- Liu, Huan;
- Pu, Calton;
- Hu, Xiaohua;
- Ahmed, Nesreen K;
- Qiao, Mu;
- Song, Yang;
- Kossmann, Donald;
- Liu, Bing;
- Lee, Kisung;
- Tang, Jiliang;
- He, Jingrui;
- Saltz, Jeffrey S
Scientific data analysis typically involves reading massive amounts of data that was generated by simulations, experiments, and observations. Performance of reading such large volumes of data from disk-based file systems is often poor because of the slow and mechanical components in the disks. Recent supercomputing systems are adding non-volatile storage layers in a hierarchy to handle the performance gap between fast main memory and slow disk-based storage. Software libraries for managing this hierarchy not only need efficient reading of data but also reduce user-involvement for cross-layer data movement. Furthermore, these libraries need to support array data access patterns into hierarchical storage management as scientific data is often organized in array-based data structures. Existing software typically manage individual storage layers requiring significant manual process in moving data among them. In this paper, we introduce a new array caching in hierarchical storage (ARCHIE) to accelerate array data analysis in a seamless fashion. ARCHIE evaluates array access patterns and prefetches data with array semantics between storage layers. Our evaluation shows that ARCHIE outperforms state-of-the-art file systems, i.e., Lustre and DataWarp, on a production supercomputing system by up to 5.8× in accessing data by scientific analysis applications.