Automatic lemmatisation for Afrikaans

Groenewald, Hendrik Johannes

Automatic lemmatisation for Afrikaans

Files

groenewald_hendrikj.pdf (4.11 MB)

Date

2006

Authors

Groenewald, Hendrik Johannes

Supervisors

Helberg, Albertus S.J.
Van Huyssteen, Gerhard B
Van den Bosch, A.

Publisher

North-West University

Abstract

A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based lemmatiser for Afrikaans already exists, but this lemmatiser produces disappointingly low accuracy figures. The performince of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser construction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Lia "Lemma-identifiseerder vir Afrikaans" 'Lemmatiser for Afrikaans'. In order to construct Lia, the following research objectives are set: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of Lia, iii) to automatically determine the algorithm and parameters settings that deliver the best performance in terms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflection and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usage and execution time increase as the amount of training data is increased and that the various feature options have a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determined by the use of PSearch, a programme that implements Wrapped Progressive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Evaluation indicates that an accuracy figure of 92,8% is obtained when training Lia with the best performing parameters for the IBI algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lemmatiser construction.

Description

Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.

Keywords

Lemmatisation, Machine learning, Memory-based learning, Human language technology, Natural language processing, Computer engineering, TIMBL, Afrikaans, Morphology

URI

http://hdl.handle.net/10394/131

Collections

Engineering

Full item page

Automatic lemmatisation for Afrikaans

Files

Date

Authors

Researcher ID

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Record Identifier

Abstract

Sustainable Development Goals

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By