Personalized video summary using visual semantic annotations and automatic speech transcriptions
Abstract
A personalized video summary is dynamically generated in our video personalization and summary system based on user preference and usage environment. The three-tier personalization system adopts the server-middleware-client architecture in order maintain, select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. In this paper, the metadata includes visual semantic annotations and automatic speech transcriptions. Our personalization and summarization engine in the middleware selects the optimal set of desired video segments by matching shot annotations and sentence transcripts with user preferences. The process includes the shot-to-sentence alignment, summary segment selection, and user preference matching and propagation. As a result, the relevant visual shot and audio sentence segments are aggregated and composed into a personalized video summary.