Conference paper

Semantic indexing of multimedia using audio, text and visual cues


We describe methods for automatic labeling of high-level semantic concepts in documentary style videos. The emphasis of this paper is on audio processing and on fusing information from multiple modalities. The work described represents initial work towards a trainable system that acquires a collection of generic "intermediate" semantic concepts across modalities (such as audio, video, text) and combines information from these modalities for automatic labeling of a "high-level" concept. Initial results suggest that multi-modal fusion achieves a 12.5% relative improvement over the best unimodal model.
