Dataflow representation of data analyses: Toward a platform for collaborative data science
Abstract
Data science plays an increasingly important role in solving today's scientific and social challenges. To promote progress toward a cure for multiple sclerosis, the Accelerated Cure Project has created an open repository of biological and survey data on patients with multiple sclerosis. Similar large-scale repositories are being created in other domains. As the open, data-driven model of science proliferates, the research community faces a growing need for a cloud platform for collaborative data science. Such a platform should facilitate collaboration between domain experts and data scientists and possess artificial intelligence capabilities for organizing, recommending, and manipulating data analyses. In this paper, we present some foundational technologies motivated by this vision. Our system automatically extracts a high-level dataflow graph from a data analysis. This graph describes how data flows through an analysis pipeline, including which statistical methods are used and how they fit together. The system requires no special annotations from the data analyst and consumes analyses written in Python using standard tools, such as Scikit-learn and Statsmodels. In this paper, we explain how our system works and how it fits into our larger vision for a collaborative data science platform.