Show simple item record

dc.contributor.advisorHelberg, Albertus S.J.
dc.contributor.advisorVan Huyssteen, Gerhard B
dc.contributor.advisorVan den Bosch, A.
dc.contributor.authorGroenewald, Hendrik Johannes
dc.date.accessioned2008-11-28T11:35:23Z
dc.date.available2008-11-28T11:35:23Z
dc.date.issued2006
dc.identifier.urihttp://hdl.handle.net/10394/131
dc.descriptionThesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
dc.description.abstractA lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based lemmatiser for Afrikaans already exists, but this lemmatiser produces disappointingly low accuracy figures. The performince of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser construction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Lia "Lemma-identifiseerder vir Afrikaans” ‘Lemmatiser for Afrikaans’. In order to construct Lia, the following research objectives are set: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of Lia, iii) to automatically determine the algorithm and parameters settings that deliver the best performance in terms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflection and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usage and execution time increase as the amount of training data is increased and that the various feature options have a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determined by the use of PSearch, a programme that implements Wrapped Progressive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Evaluation indicates that an accuracy figure of 92,8% is obtained when training Lia with the best performing parameters for the IBI algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lemmatiser construction.
dc.publisherNorth-West University
dc.subjectLemmatisationen
dc.subjectMachine learningen
dc.subjectMemory-based learningen
dc.subjectHuman language technologyen
dc.subjectNatural language processingen
dc.subjectComputer engineeringen
dc.subjectTIMBLen
dc.subjectAfrikaansen
dc.subjectMorphologyen
dc.titleAutomatic lemmatisation for Afrikaansen
dc.typeThesisen
dc.description.thesistypeMasters


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record