About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
BIBM 2018
Conference paper
Optimization of Genomics Analysis Pipeline for Scalable Performance in a Cloud Environment
Abstract
Cost-effective and scalable analysis of the human genome is crucial for the democratization of precision medicine. The new version of the Genome Analysis Toolkit (GATK4), an industry-standard end-to-end tool for variant discovery analysis in next-generation sequencing (NGS) data, introduces Apache Spark support to improve scaling for both local multithreading and cluster-wide parallelization, as well as facilitate the deployment on cloud infrastructures. In this paper, we evaluate the performance and scalability of GATK4-Spark running on a next-generation cloud platform. After identifying bottlenecks and scaling challenges, we optimize the software stack that includes an optimized JVM, enhancements of Spark and targeted configuration tuning, which in turn enables more effective use of the underlying computing resources. We demonstrate the effectiveness of our comprehensive optimization techniques on a reference Single Nucleotide Polymorphisms (SNPs) pipeline, achieving ≤1 hr computation time for whole human genome analysis.