A Hierarchical Decision Approach to Large-Vocabulary Discrete Utterance Recognition
Abstract
Very short response time is a critical requirement for automatic discrete utterance recognition. The real-time vocabulary size of most of today's commercially available recognizers is limited to several hundreds of utterances, primarily due to the fact that detailed acoustic matching involves considerable computation. The method presented here offers an economical solution to the real-time large-vocabulary recognition problem by carrying out recognition in two stages. In the initial stage, the incoming utterance is linearly matched against the entire vocabulary using only two features-utterance duration and either two or three average spectra for each utterance. While the number of prototypes matched is large, the time required per match is substantially reduced. During this initial stage, a preset number of best-match prototypes is determined for each unknown input. In the second stage, matching is performed for the best-match list based upon more detailed features (e.g., 10-ms log-power spectra), using more elaborate matching methodology, e.g., dynamic programming. Evaluation experiments were conducted using the 2000 most frequent words in an office-correspondence corpus and three normal adult-male talkers. It was observed that first-stage best-match lists of 30–50 items included the “correct” words between 99.0 and 99.5 percent of the time. Using DP on 10-ms spectral samples for the second stage, recognition accuracy ranged from 86.5 to 94.5 percent. A match-limiter, when used with a 50–64-word, commercially available recognizer for the second stage, makes near-real-time large-vocabulary recognition feasible. © 1983 IEEE