Effectiveness assessment of solid-state drive used in big data services
Abstract
Big data poses challenges to the technologies required to process data of high volume, velocity, variety, and veracity. Among the challenges, the storage and computing required by big data analytics is usually huge, and as a result big data capabilities are often provisioned in cloud and delivered in the form of Web-based services. Solid-state drive (SSD) is widely used nowadays as an elementary hardware feature in cloud infrastructure for big data services. For example, Amazon Web Service (AWS) offers EC2 instances with SSD storage, and its keyvalue data store, DynamoDB, is backed up by SSD for superior performance. Compared to hard disk drive (HDD), SSD prevails in both access latency and bandwidth. In the foreseeable future, SSD would be readily available on commodity servers though its capacity would be neither large enough nor cost effective to accommodate big data on its own. Therefore, it is essential to investigate how to efficiently leverage SSD as one layer in a storage hierarchy in addition to HDD. In this paper, we investigate the effectiveness of using SSD in three workloads, namely standalone Hadoop MapReduce jobs, Hive jobs, and HBase queries. Firstly, we device an approach to enable Hadoop Distributed File System (HDFS) having a SSD-HDD storage hierarchy. Secondly, we investigate the IO involved in different phases of Hadoop jobs and design different schemes to place data discriminatively in the aforementioned storage hierarchy. Afterward, the effectiveness of different schemes are evaluated with respect to job run time. Finally, we summarize best practices of data placement for examined workloads in a SSD-HDD storage hierarchy.