Automatic lemmatisation for Afrikaans

Groenewald, Hendrik Johannes

View/Open

groenewald_hendrikj.pdf (4.108Mb)

Date

2006

Author

Groenewald, Hendrik Johannes

Metadata

Show full item record

Abstract

A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based lemmatiser for Afrikaans already exists, but this lemmatiser produces disappointingly low accuracy figures. The performince of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser construction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Lia "Lemma-identifiseerder vir Afrikaans” ‘Lemmatiser for Afrikaans’. In order to construct Lia, the following research objectives are set: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of Lia, iii) to automatically determine the algorithm and parameters settings that deliver the best performance in terms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflection and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usage and execution time increase as the amount of training data is increased and that the various feature options have a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determined by the use of PSearch, a programme that implements Wrapped Progressive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Evaluation indicates that an accuracy figure of 92,8% is obtained when training Lia with the best performing parameters for the IBI algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lemmatiser construction.

URI

http://hdl.handle.net/10394/131

Collections

Engineering [1424]