Effective automatic speech recognition data collection for under–resourced languages
De Vries, Nicolaas Johannes
MetadataShow full item record
As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for under resourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established.
- Engineering