Word IV-gram probability estimation from A Japanese raw corpus
Abstract
Statistical language modeling plays an important role in a state-of-the-art speech recognizer. The most used language model (LM) is word n-gram model, which is based on the frequency of words and word sequences in a corpus. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we propose a method for building an LM directly from a raw corpus. In this method, sentences in the raw corpus are regarded as sentences annotated with stochastic word boundary information. In the experiments, we compared the predictive powers of an LM built only from a segmented coprus and an LM built from the segmented corpus and a raw corpus. The result showed that we succeeded in reducing the perplexity by 42.9% using a raw corpus by our method.