CLIR for informal content in Arabic forum posts
Abstract
The field of Cross-Language Information Retrieval (CLIR) addresses the problem of finding documents in some language that are relevant to a question posed in a different language. Retrieving answers to questions written using formal vocabulary from collections of informal documents, as with many types of social media, is a largely unexplored subfield of CLIR. Because formal and informal content are often intermingled, CLIR systems that excel at finding formal content may tend to select formal over informal content. To measure this effect, a test collection annotated for both relevance and informality is needed. This paper describes the development of a small test collection for this task, with questions posed in formal English and the documents consisting of intermixed formal and informal Arabic. Experiments with this collection show that dialect classification can help to recognize informal content, thus improving precision. At the same time, the results indicate that neither dialect-tuned morphological analysis nor a lightweight CLIR approach that minimizes propagation of translation errors yet yield a reliable improvement in recall for informal content when compared to a straightforward document translation architecture.