Show simple item record

dc.contributor.advisorVan Huyssteen, Gerhard B
dc.contributor.advisorBarnard, E.
dc.contributor.authorPuttkammer, Martin Johannes
dc.date.accessioned2015-02-20T13:31:25Z
dc.date.available2015-02-20T13:31:25Z
dc.date.issued2014
dc.identifier.urihttp://hdl.handle.net/10394/13408
dc.descriptionPhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014en_US
dc.description.abstractThe development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time).en_US
dc.language.isoenen_US
dc.subjectAfrikaansen_US
dc.subjectAutomatic speech recognitionen_US
dc.subjectLemmatisationen_US
dc.subjectResource-scarce languagesen_US
dc.subjectHuman language technologyen_US
dc.subjectResource developmenten_US
dc.titleEfficient development of human language technology resources for resource-scarce languagesen
dc.typeThesisen_US
dc.description.thesistypeDoctoralen_US
dc.contributor.researchID10215484 - Van Huyssteen, Gerhardus Beukes (Supervisor)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record