Trajectory modelling with limited speech data
Abstract
State-of-the-art automatic speech recognition (ASR) systems are built using hundreds or even thousands of hours of speech data. Even then, high recognition accuracy is achievable only by carefully constraining the recognition domain. This reliance on large speech corpora remains a major challenge when building ASR systems for resource constrained languages. The need for large corpora is partially due to the substantial variation observed in different spoken realisations of the same text but to significantly to co-articulation plays an important role. When building an ASR system, it is not sufficient to observe a
large number of samples of each acoustic unit during training; it is necessary to observe sufficient samples appearing in similar contexts to those found in the test data. To obtain a better understanding of co-articulation effects, we analysed the behaviour of phones in context, using trajectory models. We developed a new model that captures the feature trajectories of acoustic unit transitions directly, and developed a way of representing the characteristic changes between different units. We found it beneficial to model these characteristic changes at the spectral rather than cepstral level, by extracting features directly from the filter bank. Applying auto-regressive moving-average (ARMA) filtering to smooth spectral energies before constructing cepstral features also improved the accuracy of trajectories. We experimented with different approaches to
identify transition model alignments and selected techniques that allowed us to locate
the characteristic changes between units with the required accuracy. We developed a new compact representation of speech units in context, estimating model parameters using the trajectory models. These models function at a sub-transitional level, enabling the construction of units that occur in unseen and rare contexts. Applying this technique, it was possible to create synthetic samples of triphone contexts, by first
constructing diphone transitions and concatenating these to form synthetic trajectories. We found that better acoustic models (producing higher likelihoods on unseen test data) could be developed by augmenting existing data with synthetic samples. When the samples were used to augment the training data in an end-to-end ASR system, promising results were obtained. A useful side effect is that the synthetic samples provide a new mechanism to improve cluster selection for unseen or rare phones during state-tying.
Collections
- Engineering [1421]