Fast, Transparent, and High-Fidelity Memoization Cache-Keys for Computational Workflows
Abstract
Computational workflows are important methods for automating complex data-generation and analysis pipelines. Workflows are composed of sub-graphs that perform specific tasks. Certain sub-graphs may appear in multiple workflows. This implies that the same task, with the same input, may execute multiple times thereby wasting computational resources. Additionally, hybrid cloud environments spanning across on-premises and public cloud deployments are increasingly popular. Identical tasks may not only execute multiple times but also on multiple heterogeneous environments. Memoization, a technique to reduce the number of redundant task executions, can tackle this problem. There is a significant body of work for the memoization of workflows. However, prior studies have only been able to satisfy at most two of the three desired features of workflow memoization: transparency (leveraged automatically), performance, and fidelity (identify desired redundant tasks). This paper first introduces a taxonomy for understanding memoization. Next, we describe a novel algorithm that uses a walk in the workflow graph to transparently generate fast and high-fidelity memoization cache-keys, overcoming limitations in prior approaches. We evaluate our scheme using three industry motivated quantum chemistry workflows and two geo-distributed OpenShift clusters. Our experimental evaluation results show the implemented scheme is fast (O(ms)) and can remove nontrivial sources of false positives and negatives that other methods may generate. Finally, we illustrate the proposed workflow memoization scheme can reduce the duration of our benchmarks by up to 10.55x nearly matching the theoretical maximum speedup of memoization for these workloads (10.69x).