Efficient development of human language technology resources for resource-scarce languages

Puttkammer, Martin Johannes

dc.contributor.advisor	Van Huyssteen, Gerhard B
dc.contributor.advisor	Barnard, E.
dc.contributor.author	Puttkammer, Martin Johannes
dc.date.accessioned	2015-02-20T13:31:25Z
dc.date.available	2015-02-20T13:31:25Z
dc.date.issued	2014
dc.identifier.uri	http://hdl.handle.net/10394/13408
dc.description	PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014	en_US
dc.description.abstract	The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time).	en_US
dc.language.iso	en	en_US
dc.subject	Afrikaans	en_US
dc.subject	Automatic speech recognition	en_US
dc.subject	Lemmatisation	en_US
dc.subject	Resource-scarce languages	en_US
dc.subject	Human language technology	en_US
dc.subject	Resource development	en_US
dc.title	Efficient development of human language technology resources for resource-scarce languages	en
dc.type	Thesis	en_US
dc.description.thesistype	Doctoral	en_US
dc.contributor.researchID	10215484 - Van Huyssteen, Gerhardus Beukes (Supervisor)

Files in this item

Name:: Puttkammer_MJ.pdf
Size:: 3.236Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Humanities [2697]

Show simple item record