• Login
    View Item 
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Evaluation of the performance of a machine learning lemmatiser for isiXhosa

    Thumbnail
    View/Open
    Mzamo_L_2015.pdf (2.018Mb)
    Date
    2015
    Author
    Mzamo, Lulamile
    Metadata
    Show full item record
    Abstract
    Human language resources (HLR) and applications currently available in South Africa are of a very basic nature, with lemmatisation being one of the basic. South African languages, except for English are considered underdeveloped when it comes to HLRs. The work detailed in this thesis is the development of a lemmatiser for one such language, namely isiXhosa. The previous benchmark in isiXhosa lemmatisation, which achieved 79.28%, was a rule-based lemmatiser implemented for the development of isiXhosa lemmatisation data. That data was used in this study. IsiXhosa, one of the South African official languages belonging to the Bantu language family that are classified as "resource scarce languages", is the second largest language in South Africa with 8.1 million mother-tongue speakers, second only to isiZulu. IsiXhosa is closely related to languages such as isiZulu, Siswati and isiNdebele and the work done in it could easily be bootstrapped to these languages. A lexicalised probabilistic graphical lemmatiser, the IsiXhosa Graphical Lemmatiser (XGL), was investigated, designed, implemented and evaluated against two benchmark lemmatisers, the CST Lemmatiser and the LemmaGen lemmatiser. The investigation towards the XGL involved five objectives. The first objective was to establish good characteristics for an automatic lemmatiser for morphologically complex languages. This was achieved by reviewing existing research material on the lemmatisation of morphological complex languages. To establish the most appropriate lemmas for isiXhosa in the context of natural language processing, a study of the isiXhosa language morphology was done, and appropriate lemmas for each word category were identified. Exploring the training data answered the objective of establishing what good data features are for an isiXhosa lemmatiser. The objective of designing an isiXhosa lemmatisation model was realised through the implementation of XGL. The last objective, the evaluation of an isiXhosa lemmatisation model, was achieved through training and testing XGL, and comparing it to two benchmark lemmatisers, the CST Lemmatiser and the LemmaGen lemmatiser. The XGL lemmatiser achieved the highest accuracy compared to the selected benchmark lemmatiser, with an accuracy rate of 83.19%.
    URI
    http://hdl.handle.net/10394/20448
    Collections
    • Engineering [1424]

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of NWU-IR Communities & CollectionsBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis Type

    My Account

    LoginRegister

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV