Multimodal Deep Learning with Boosted Trees for Edge Inference
Abstract
We provide a method to combine and optimize knowledge from neural network and gradient boosted tree models for inference on edge devices. This is important for multimodal settings having both image and sensor, or time series, data. The proposed approach retains the learning capabilities and jointly distills knowledge from the tree structure, approximated by an embedding layer, and the internal representations of a CNN, along with the aggregated outputs of the heterogeneous teacher models. Performance is better than that of unimodal and standard multimodal training approaches. The resulting multimodal network is smaller and consumes less memory during inference than the alternative networks, making it ideal for applications at the edge.