Show simple item record

dc.contributor.advisorDavel, M.H.
dc.contributor.authorDe Vries, Nicolaas Johannesen_US
dc.date.accessioned2012-09-10T16:24:52Z
dc.date.available2012-09-10T16:24:52Z
dc.date.issued2011en_US
dc.identifier.urihttp://hdl.handle.net/10394/7354
dc.descriptionThesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012.
dc.description.abstractAs building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for under resourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established.en_US
dc.publisherNorth-West University
dc.subjectUnder-resourced languagesen_US
dc.subjectNew languagesen_US
dc.subjectSpeech resourcesen_US
dc.subjectASR corporaen_US
dc.subjectAutomatic speech recognitionen_US
dc.subjectDeveloping worlden_US
dc.subjectSpeech data collectionen_US
dc.subjectSpoken language resourcesen_US
dc.subjectAndroiden_US
dc.subjectNCHLTen_US
dc.titleEffective automatic speech recognition data collection for under–resourced languagesen
dc.typeThesisen_US
dc.description.thesistypeMastersen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record