• Login
    View Item 
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Automatic lemmatisation for Afrikaans

    Thumbnail
    View/Open
    groenewald_hendrikj.pdf (4.108Mb)
    Date
    2006
    Author
    Groenewald, Hendrik Johannes
    Metadata
    Show full item record
    Abstract
    A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based lemmatiser for Afrikaans already exists, but this lemmatiser produces disappointingly low accuracy figures. The performince of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser construction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Lia "Lemma-identifiseerder vir Afrikaans” ‘Lemmatiser for Afrikaans’. In order to construct Lia, the following research objectives are set: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of Lia, iii) to automatically determine the algorithm and parameters settings that deliver the best performance in terms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflection and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usage and execution time increase as the amount of training data is increased and that the various feature options have a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determined by the use of PSearch, a programme that implements Wrapped Progressive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Evaluation indicates that an accuracy figure of 92,8% is obtained when training Lia with the best performing parameters for the IBI algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lemmatiser construction.
    URI
    http://hdl.handle.net/10394/131
    Collections
    • Engineering [1424]

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of NWU-IR Communities & CollectionsBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis Type

    My Account

    LoginRegister

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV