A methodology and tool suite for evaluation of accuracy of interoperating statistical natural language processing engines
Abstract
Evaluation of accuracy of natural language processing (NLP) engines plays an important role in their development and improvement. Such evaluation usually takes place at a per-engine level. For example, there are evaluation methods for engines such as speech recognition, machine translation, story boundary detection, etc. Many real-world applications require combinations of these functions. This has become possible now with NLP engines attaining sufficient accuracy to be able to combine them for complex tasks. However, it is not evident how the accuracy of output of such aggregates of engines will be evaluated. We present an evaluation methodology to address this problem. The key contribution of our work is an extensible methodology that narrows down possible combinations of machine outputs and ground truths to be compared at various stages in an aggregate of interoperating engines. We also describe two example evaluation modules that we developed following this methodology. Copyright © 2008 ISCA.