Evaluation of the performance of a machine learning lemmatiser for isiXhosa
Human language resources (HLR) and applications currently available in South Africa are of a very basic nature, with lemmatisation being one of the basic. South African languages, except for English are considered underdeveloped when it comes to HLRs. The work detailed in this thesis is the development of a lemmatiser for one such language, namely isiXhosa. The previous benchmark in isiXhosa lemmatisation, which achieved 79.28%, was a rule-based lemmatiser implemented for the development of isiXhosa lemmatisation data. That data was used in this study. IsiXhosa, one of the South African official languages belonging to the Bantu language family that are classified as "resource scarce languages", is the second largest language in South Africa with 8.1 million mother-tongue speakers, second only to isiZulu. IsiXhosa is closely related to languages such as isiZulu, Siswati and isiNdebele and the work done in it could easily be bootstrapped to these languages. A lexicalised probabilistic graphical lemmatiser, the IsiXhosa Graphical Lemmatiser (XGL), was investigated, designed, implemented and evaluated against two benchmark lemmatisers, the CST Lemmatiser and the LemmaGen lemmatiser. The investigation towards the XGL involved five objectives. The first objective was to establish good characteristics for an automatic lemmatiser for morphologically complex languages. This was achieved by reviewing existing research material on the lemmatisation of morphological complex languages. To establish the most appropriate lemmas for isiXhosa in the context of natural language processing, a study of the isiXhosa language morphology was done, and appropriate lemmas for each word category were identified. Exploring the training data answered the objective of establishing what good data features are for an isiXhosa lemmatiser. The objective of designing an isiXhosa lemmatisation model was realised through the implementation of XGL. The last objective, the evaluation of an isiXhosa lemmatisation model, was achieved through training and testing XGL, and comparing it to two benchmark lemmatisers, the CST Lemmatiser and the LemmaGen lemmatiser. The XGL lemmatiser achieved the highest accuracy compared to the selected benchmark lemmatiser, with an accuracy rate of 83.19%.
- Engineering