A structure-sensitive framework for text categorization
Abstract
This paper presents a framework called Structure Sensitive CATegorization(SSCAT), that exploits document structure for improved categorization. There are two parts to this framework, viz. (1) Documents often have layout structure, such that logically coherent text is grouped together into fields using some mark-up language. We use a log-linear model, which associates one or more features with each field. Weights associated with the field features are learnt from training data and these weights quantify the per - class importance of the field features in determining the category for the document. (2) We employ a technique that exploits the parse tree of fields that are phrasal constructs, such as title and associates weights with words in these constructs while boosting weights of important words called focus words. These weights are learnt from example instances of phrasal constructs, marked with the corresponding focus words. The learning is accomplished by training a classifier that uses linguistic features obtained from the text's parse structure. The weighted words, in fields with phrasal constructs, are used in obtaining features for the corresponding fields in the overall framework. SSCAT was tested on the supervised categorization task of over one million products from Yahoo!'s on-line shopping data. With an accuracy of over 90%, our classifier outperforms Naive Bayes and Support Vector Machines. This not only shows the effectiveness of SSCAT but also strengthens our belief that linguistic features based on natural language structure can improve tasks such as text categorization.