About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Big Data 2018
Conference paper
Column Cache: Buffer Cache for Columnar Storage on HDFS
Abstract
Columnar storage is a data source for data analytics in distributed computing frameworks. For portability and scalability, columnar storage is built on top of existing distributed file systems with columnar data representations such as Parquet, RCFile, and ORC. However, these representations fail to utilize high-level information (e.g., columnar formats) for low-level disk buffer management in operating systems. As a result, data analytics workloads suffer from redundant memory buffers with expensive garbage collections, unnecessary disk readahead, and cache pollution in the operating system buffer cache.We propose column cache, which unifies and re-structures the buffers and caches of multiple software layers from columnar storage to operating systems. Column cache leverages high-level information such as file formats and query plans for enabling adaptive disk reads and cache eviction policies. We have developed a column cache prototype for Apache Parquet and observed that our prototype reduced redundant resource utilization in Apache Spark. Specifically, with our prototype, Spark showed a maximum speedup of 1.28x in TPC-DS workloads while increasing Linux page cache size by 18%, reducing total disk reads by 43%, and reducing garbage collection time in a Java virtual machine by 76%.