Column Cache: Buffer Cache for Columnar Storage on HDFS
Abstract
Columnar storage is a data source for data analytics in distributed computing frameworks. For portability and scalability, columnar storage is built on top of existing distributed file systems with columnar data representations such as Parquet, RCFile, and ORC. However, these representations fail to utilize high-level information (e.g., columnar formats) for low-level disk buffer management in operating systems. As a result, data analytics workloads suffer from redundant memory buffers with expensive garbage collections, unnecessary disk readahead, and cache pollution in the operating system buffer cache.We propose column cache, which unifies and re-structures the buffers and caches of multiple software layers from columnar storage to operating systems. Column cache leverages high-level information such as file formats and query plans for enabling adaptive disk reads and cache eviction policies. We have developed a column cache prototype for Apache Parquet and observed that our prototype reduced redundant resource utilization in Apache Spark. Specifically, with our prototype, Spark showed a maximum speedup of 1.28x in TPC-DS workloads while increasing Linux page cache size by 18%, reducing total disk reads by 43%, and reducing garbage collection time in a Java virtual machine by 76%.