About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Data + AI Summit 2021
Talk
Enabling Vectorized Engine in Apache Spark
Abstract
This talk explains how to enable a vectorized engine in Apache Spark to accelerate Apache Spark programs. Vectorization is an exciting approach to maximize performance as Delta Lake and other commercial database use. On the other hand, the current Apache Spark does not use the vectorization technique yet because it is not easy to use vector instructions in the current Java language. First, this talk reviews Vector API for ease of use of the vector instructions in Java 16. Then, this talk discusses three possible approaches to vectorize Apache Spark Engine by using Vector API: 1) replace external libraries such as BLAS library, 2) use a vectorized runtime such as a sort routine, and 3) generate vectorized Java code by Catalyst from a given SQL query. Finally, this talk shares analysis and performance results by these approaches. Here are takeaways of this talk: 1. Overview of Vector API to vectorize Java programs 2. Multiple approaches to use a vectorized engine in Apache Spark 3. Analysis and performance results by these vectorization approaches