Stark: Optimizing In-Memory Computing for Dynamic Dataset Collections
Abstract
Emerging distributed in-memory computing frameworks, such as Apache Spark, can process a huge amount of cached data within seconds. This remarkably high efficiency requires the system to well balance data across tasks and ensure data locality. However, it is challenging to satisfy these requirements for applications that operate on a collection of dynamically loaded and evicted datasets. The dynamics may lead to time-varying data volume and distribution, which would frequently invoke expensive data re-partition and transfer operations, resulting in high overhead and large delay. To address this problem, we present Stark, a system specifically designed for optimizing in-memory computing on dynamic dataset collections. Stark enforces data locality for transformations spanning multiple datasets (e.g., join and cogroup) to avoid unnecessary data replications and shuffles. Moreover, to accommodate fluctuating data volume and skeweddata distribution, Stark delivers elasticity into partitions to balance task execution time andreduce job makespan. Finally, Stark achieves bounded failure recovery latency byoptimizing the data checkpointing strategy. Evaluations on a 50-server cluster show that Stark reduces the job makespan by 4X and improves system throughput by 6X compared to Spark.