Scalable in-memory paradigm for genomics data processing
Abstract
Disk storage and access incur huge latency in processing of genomics datasets. To accelerate downstream data processing, the data needs to be closer to the processor and available in fast access memory devices. Historically, it was difficult to achieve this at larger scale due to cost and architectural constrains around dynamic random-access memory (DRAM) technology. There have been rapid improvements in memory technologies and computer architectures that allow cost effective solutions for processing large amount of data in significantly less time. In-memory paradigm takes advantage of the new architectural designs, where large memory pool can be created within a cluster or a cloud by means of distributed in-memory databases or directly by aggregating individual nodes together into a shared memory system. An in-memory paradigm can minimize the latency associated with the traditional HPC and Cloud based bioinformatics workflows, where tools are stitched in a sequential manner and output from one tool feeds as input for the next tool in the workflow, and at each stage of the workflow a significant amount of secondary disk based I/O is performed. We created this study to investigate if it is possible to utilize in-memory technologies, scalable in-memory databases, to accelerate genomics data processing.