Jim Challenger, Paul Dantzig, et al.
ACM/IEEE SC 1998
Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.
Jim Challenger, Paul Dantzig, et al.
ACM/IEEE SC 1998
Sugato Bagchi, Eugene Hung, et al.
ValueTools 2006
Arun Iyengar
IEEE TKDE
William Conner, Arun Iyengar, et al.
WWW 2009