Effective automatic speech recognition data collection for under–resourced languages

De Vries, Nicolaas Johannes

Effective automatic speech recognition data collection for under–resourced languages

Files

DeVries_NJ.pdf (1.03 MB)

Date

2011

Authors

De Vries, Nicolaas Johannes

Supervisors

Davel, M.H.

Publisher

North-West University

Abstract

As building transcribed speech corpora for under-resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for under resourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open-source code-base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under-resourced languages. It introduces a semireal-time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established.

Description

Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012.

Keywords

Under-resourced languages, New languages, Speech resources, ASR corpora, Automatic speech recognition, Developing world, Speech data collection, Spoken language resources, Android, NCHLT

URI

http://hdl.handle.net/10394/7354

Collections

Engineering

Full item page

Effective automatic speech recognition data collection for under–resourced languages

Files

Date

Authors

Researcher ID

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Record Identifier

Abstract

Sustainable Development Goals

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By