• Login
    View Item 
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    •   NWU-IR Home
    • Electronic Theses and Dissertations (ETDs)
    • Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Morphological segmentation of isiXhosa using unsupervised machine learning

    Thumbnail
    View/Open
    Mzamo L.pdf (3.606Mb)
    Date
    2021
    Author
    Mzamo, L.
    Metadata
    Show full item record
    Abstract
    In this work the use of unsupervised machine learning in the morphological segmentation of Nguni languages, evaluated on isiXhosa, is advanced. The work researches, extends, implements and evaluates unsupervised machine learning techniques in the morphological segmentation of Nguni language, specifically isiXhosa. IsiXhosa, one of the eleven South African official languages, is a Nguni language with 9.7 million mother-tongue speakers (17% of the South African population). The language is an agglutinating and synthetic language characterised by highly inflected words. Its nouns are characterised by a class system based on their prefixes. Verbs are characterised by a concatenation of numerous prefixes and suffixes - and their concordial agreement with the subject noun or object noun. Two segmenters were developed using state of the art techniques in morphological segmentation: the IsiXhosa Branching Entropy Segmenter (XBES), using branching entropy; and the IsiXhosa Heuristic Maximum Likelihood Segmenter (XHMLS), using probabilistic language models. The XBES branching entropy segmenter matched the 77.2 ± 0.10% accuracy of the benchmark (Morfessor-Baseline) with an accuracy of 77.4 ± 0.32% and outperforms the benchmark on its F1 Score of 48.9 ± 0.75% with a score of 58.0 ± 0.10%. These accuracy results were achieved with a z-score normalised variation of branching entropy (NⴭVBE) mode trained on 11-grams from a 1.5 million word corpus. The F1 Score was achieved using the unnormalised variation of branching entropy (VBE) technique trained with 9-grams from a 1.5 million word corpus. The study established that the character level n-gram length that encapsulates the predictability of isiXhosa words to be ten (10), which corresponds to using 9-grams. The XHMLS maximum likelihood segmenter has an accuracy of 75.2 ± 0.18%, which did not meet the accuracy of the benchmark but outperformed both XBES and Morfessor-Baseline on the F1 Score with a score of 59.4 ± 0.20%. The best performing mode of XHMLS was the independent affixes (XHMLS-IA) language model confirming that morphemic dependencies in isiXhosa are stronger within the affixes and weaker among the affix and that the roots in the language permeate many of the word categories.
    URI
    https://orcid.org/0000-0002-8867-7416
    http://hdl.handle.net/10394/38406
    Collections
    • Engineering [1424]

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of NWU-IR Communities & CollectionsBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis Type

    My Account

    LoginRegister

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV