Abstract
Understanding how human emotion is evoked from visual content is a task that we as people do every day, but machines have not yet mastered. In this work we address the problem of predicting the intended evoked emotion at given points within movie trailers. Movie Trailers are carefully curated to elicit distinct and specific emotional responses from viewers, and are therefore well-suited for emotion prediction. However, current emotion recognition systems struggle to bridge the 'affective gap', which refers to the difficulty in modeling high-level human emotions with low-level audio and visual features. To address this problem, we propose a mid-level concept feature, which is based on detectable movie shot concepts which we believe to be tied closely to emotions. Examples of these concepts are 'Fight', 'Rock Music', and 'Kiss'. We also create 2 datasets, the first with shot-level concept annotations for learning our concept detectors, and a separate, second dataset with emotion annotations taken throughout the trailers using the two dimensional arousal and valence model for emotion annotation. We report the performance of our concept detectors, and show that by using the output of these detectors as a mid-level representation for the movie shots we are able to more accurately predict the evoked emotion throughout a trailer than by using low-level features.