Automatic speech recognition of poor quality audio using generative adversarial networks
Abstract
In this study, we investigate the use of generative adversarial networks (GANs) to improve
speech recognition performance of poor quality audio obtained from a real-world source.
A GAN is developed to transform acoustic features of noisy audio prior to downstream
acoustic modelling. The system utilises a baseline acoustic model trained on good quality
data to improve the performance on mismatched data. This is achieved without requiring
manual creation of parallel datasets. The practical relevance of the GAN is realised when
a strong commercial-grade speech recognition system { which has already been optimised
for a given set of conditions { is required to decode new mismatched data. The GAN can
then act as a front-end to the existing system.
We compare the GAN-based front-end to multi-style training (MTR) on three datasets
in a controlled environment. The GAN system is much faster to train than a comparable
MTR system with similar performance. The developed GAN is applied to a South African
call centre dataset and achieves consistent improvements over a baseline model. Therefore,
this provides a practical approach to improve ASR systems in mismatched environments.
Collections
- Engineering [1418]