Publication
ACE 2019
Conference paper

Managing Data Traceability in the Data Lifecycle for Deep Learning Applied to Seismic Data

View code

Abstract

Identification of geological features in seismic images is a typical activity for the discovery of oil reservoirs. Geoscientists spend hours recognizing structures such as salt bodies, faults, and other geological features in the subsurface. Trying to automate such activity is of high interest for academia and O&G industry. Deep Learning (DL) has become a powerful AI technique to help geoscientists to accelerate their tasks. However, training DL classifiers to identify textures in seismic images requires a non-trivial data lifecycle comprising large geological data preprocessing, filtering, and analysis all the way from raw files (e.g., SEG-Y files) until the generation of trained DL models for geological textures classification. Due to the complexity, the data lifecycle requires a decomposition into smaller parts, each one potentially addressed by different teams in a collaboration of geoscientists, computer scientists, statisticians, and others. Each unit has a workflow to automate data-intensive tasks and to store data. Although this increases the productivity of each group, it introduces a significant problem when one needs to analyze the data in an integrated way across the distributed data stores. This paper applies scientific workflow provenance techniques to the data lifecycle of a DL-based classifier of geological textures in seismic images. The applied techniques are based on (i) dataflow modeling, considering the data dependencies across dataflows that use different data stores; (ii) workflow code instrumentation to capture data provenance; and (iii) creation of a provenance database enriched with application-specific data to provide data traceability throughout the entire data lifecycle. We validated our methodology in the data lifecycle to train the DL classifier. It is composed of five workflows that process over 5 TB of geological data stored in five distributed stores. The workflows (i) clean and filter seismic and horizons raw data files; (ii) create geospatial indexes and add geoscientists’ annotations for seismic files; and (iii) for horizon files; (iv) generate input training and validation datasets for the DL classifier; and (v) train the classifier in a High-Performance Computing machine. We show that the applied techniques helped the multidisciplinary teams to find, analyze, and understand the processed data in the workflows in an integrated way.