• Login
    View Item 
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Effective automatic speech recognition data collection for under–resourced languages

    Thumbnail
    View/Open
    DeVries_NJ.pdf (1.031Mb)
    Date
    2011
    Author
    De Vries, Nicolaas Johannes
    Metadata
    Show full item record
    Abstract
    As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for under resourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established.
    URI
    http://hdl.handle.net/10394/7354
    Collections
    • Engineering [1424]

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of NWU-IR Communities & CollectionsBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis Type

    My Account

    LoginRegister

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV