sDoc: Exploring social wisdom for document enhancement in web mining
Abstract
Web document could be seen to be composed of textual content as well as social metadata of various forms (e.g., anchor text, search query and social annotation), both of which are valuable to indicate the semantic content of the document. However, due to the free nature of the web, the two streams of web data suffer from the serious problems of noise and sparseness, which have actually become the major challenges to the success of many web mining applications. Previous work has shown that it could enhance the content of web document by integrating anchor text and search query. In this paper, we study the problem of exploring emergent social annotation for document enhancement and propose a novel reinforcement framework to generate "social representation" of document. Distinguishing from prior work, textual content and social annotation are enhanced simultaneously in our framework, which is achieved by exploiting a kind of mutual reinforcement relationship behind them. Two convergent models, social content model and social annotation model, are symmetrically derived from the framework to represent enhanced textual content and enhanced social annotation respectively. The enhanced document is referred to as Social Document or sDoc in that it could embed complementary viewpoints from many web authors and many web visitors. In this sense, the document semantics is enhanced exactly by exploring social wisdom. We build the framework on a large Del.icio.us data and evaluate it through three typical web mining applications: annotation, classification and retrieval. Experimental results demonstrate that social representation of web document could boost the performance of these applications significantly. Copyright 2009 ACM.