Computation of Acoustic Likelihoods
Consider a set of T observation vectors (or feature vectors) and a set of Gaussian Mixture Models. The computation of acoustic likelihoods consists in computing the log-likelihood of each pair of observation vector and GMM. Considering that a typical medium-sized speech recognition system may contain on the order of 600 000 Gaussians, this is a highly intensive computational task.
However, during a Viterbi beam search, only acoustic likelihoods for Gaussians associated with states having survived the pruning process are needed. This considerably reduces the computational burden. Yet, even with this approach, the computation time of acoustic likelihoods remains an important part of the overall process.
The Speech Recognition Network
Traditional speech recognition systems such as HTK are constructed using weighted automata. In speech recognition, the recognition network has many levels of representation. For example, possible sentences are represented by sequences of words that are themselves represented by sequences of phonemes. In the context of automata, these different representations are implemented using the substitution operation. For example, in the graph of words, a transition for a given word w is substituted by a subgraph representing its phonetic sequence. The major disadvantage of this approach is that a change in the network (for example, the addition of a new level of representation) implies that the program performing the search in the recognition network also has to be updated.
The composition operation allows FSTs to model many levels of representations in a normalized way. Therefore, the recognizer can work on different recognition networks (with different levels of representation) without updating the program itself.
Transducers Combination
The transducer HCDG is constructed using the composition operation. However, in the case of a large vocabulary system, the intermediate results grow very rapidly and there is not enough memory to perform the composition. The problem is solved by using the determinization operation since in the case of transducers used in speech recognition, the determinization considerably decreases the number of states and transitions by eliminating the number of redundant paths.
Therefore, the creation of HCDG proceeds in several steps. The transducer DG is obtained by the composition D • G and it has to be determinized. Recall that transducer D maps sequences of phonemes to words. The presence of homophones makes transducer DG non-determinizable since an unbounded delay is introduced. In particular, the presence of homophones comes from the fact that two or more different words have the same phoneme sequence.
Mapping Recognition FST States to Heuristic States
Recall that the A* search uses the heuristic cost given by the function h(qr, t), where qr is a recognition FST state. In essence, this function performs a lookup in the Viterbi treillis computed on the heuristic. Thus, we need to know which state (qh, t) in the heuristic is equivalent to (qr, t). A mapping between states of the heuristic and those of the recognition FST must thus be discovered.
Both the heuristic and recognition FST map a sequence of distributions to a sequence of words. Since both FSTs are built with the same HMM, the same context dependency rules and the same list of words, they both translate the same sequence of distributions to a sequence of words. This characteristic can be used for building the mapping between states, considering that a sequence of states representing a sequence of distributions in the recognition FST should be equivalent to a sequence of states representing the same sequence of distributions in the heuristic FST. The word sequences produced by both FSTs can be ignored since they are useless for establishing the mapping.
Effect of the Lookahead on Accuracy and Computation Time
This experiment explores the effect of different lookahead values (between 10 and 200) on both the accuracy and computation time. The A* search computation time shows a similar behavior, with a small drop at the beginning followed by an increase with the lookahead duration. The cause of this phenomenon is that the error, which is the difference between the heuristic approximation and the real cost, increases with the length of the lookahead. This reduces the discriminative power of the heuristic and, consequently, increases the complexity of the search.
The heuristic computation time increases linearly with the length of the lookahead. That is to be expected since the amount of work for each frame does not depend on the input audio and is thus constant for a given heuristic WFS.
|
Table des matières
INTRODUCTION
CHAPTER 1 BACKGROUND
1.1 Speech Recognition
1.1.1 Feature Extraction
1.1.1.1 Pre-Emphasis
1.1.1.2 Windowing
1.1.1.3 Fourier Transform
1.1.1.4 Filterbank Analysis
1.1.1.5 Delta and Acceleration Coefficients
1.1.2 Language Model
1.1.2.1 Smoothing of N-Gram Models
1.1.2.2 Witten-Bell Discounting
1.1.2.3 Good-Turing Discounting
1.1.2.4 Backoff N-Gram Model
1.1.2.5 Evaluation of Language Models
1.1.3 Acoustic Model
1.1.3.1 Hidden Markov Model
1.1.3.2 Evaluation Problem
1.1.3.3 Decoding Problem
1.1.3.4 Learning Problem
1.1.3.5 Preliminary Defintions
1.1.3.6 Baum-Welch Algorithm
1.1.4 Evaluation
1.2 Weighted Finite State Transducers
1.2.1 Automata
1.2.2 Weighted Automata
1.2.3 Epsilon Transitions
1.2.4 Determinism
1.2.5 Finite-State Transducers
1.2.6 String-To-String Transducers
1.2.7 Weighted String-To-String Transducers
1.2.8 Epsilon Symbols in String-To-String Transducers
1.2.9 Sequential Transducers
1.2.10 Operations on Transducers
1.2.10.1 Reverse
1.2.10.2 Composition
1.2.10.3 Determinization
1.2.10.4 Other Operations
1.3 Parallel Architectures
1.3.1 Multicore Processors
1.3.2 Graphic Processor Units
1.3.2.1 Introduction to CUDA
1.3.3 Performance Evaluation
1.4 Summary
CHAPTER 2 ACOUSTIC LIKELIHOOD COMPUTATIONS
2.1 Computation of Acoustic Likelihoods
2.2 Computation of Acoustic Likelihoods on Multicore CPUs
2.3 Computation of Acoustic Likelihoods on GPUs
2.3.1 Reduction Algorithm
2.3.2 Kernel for Acoustic Computation
2.3.3 Consecutive Frame Computation
2.4 Results
2.5 Summary
CHAPTER 3 SEARCHING THE RECOGNITION NETWORK
3.1 The Speech Recognition Network
3.1.1 Speech Recognition Transducers
3.1.1.1 Transducer H
3.1.1.2 Transducer C
3.1.1.3 Transducer D
3.1.1.4 Transducer G
3.1.1.5 Phonological Rules
3.1.2 Transducers Combination
3.2 Viterbi Algorithm
3.3 A* Algorithm
3.3.1 Unigram Language Model Heuristic
3.3.2 Mapping Recognition FST States to Heuristic States
3.3.3 Block Processing
3.3.4 Heuristic Decoding Parallelization
3.3.5 Consecutive Block Computing
3.3.6 Computing Heuristic Costs on GPUs
3.4 Real-Time Transcription
3.4.1 A* Search Real-Time Transcription
3.5 Results
3.5.1 Effect of the Lookahead on Accuracy and Computation Time
3.5.2 Parallelization of Heuristic Computation
3.6 Summary
CHAPTER 4 RESULTS
4.1 Putting It All Together
4.2 Experimental Setup
4.3 Comparison with the Classical Viterbi Beam Search
4.4 Using a GPU and a Multi-Core Processor
4.5 Using a Non-Admissible Heuristic
4.6 Summary
CHAPTER 5 ANOTHER APPLICATION OF GPUS : COPY DETECTION
5.1 Detection Process
5.1.1 Fingerprint Matching
5.1.2 Copy Detector
5.1.3 Energy-Difference Fingerprint
5.1.4 Nearest-Neighbor Fingerprint
5.1.5 Nearest-Neighbor Kernel
5.1.6 Nearest-Neighbor Feature Search
5.1.7 Combining Both Fingerprints
5.2 Applications of Copy Detection
5.2.1 Detection of Illegal Audio Copy
5.2.2 Advertisement Detection
5.2.3 Film Edition
5.3 Summary
CONCLUSION
BIBLIOGRAPHY
Télécharger le rapport complet