PLAGIARISM DECLARATION I ______________________________________________________________________________ (full name and surname and student number) hereby declare that this assignment / paper / project / portfolio is my own work. I further declare that: 1. the text and bibliography reflect the sources I have consulted, and 2. where I have made reproductions of any literary or graphic work(s) from someone else, I have obtained the necessary prior written approval of the relevant author(s)/publisher(s)/creator(s) of such works and/or, where applicable, from the Dramatic, Artistic and Literary Rights Organisation (DALRO). 3. sections with no source referrals are my own ideas, arguments and/or conclusions. Signature: ____________________________ Student number: ____________________________ Date: __________________________ Sthembiso Nkosentsha Mkhwanazi 19 March 2025 Acknowledgements I begin by thanking God for His grace in my life, which has enabled me not only to will but also to do this work (Philippians 2:13). Throughout it all, I remain deeply grateful for the incredible people I’ve had the privilege to meet - serendipitously - many of whom have contributed significantly to this study in ways that words may never fully capture. I will, however, attempt to acknowledge a few by name. First and foremost, I wish to express my sincere gratitude to Dr Laurette Marais, my main supervisor, for her support throughout this journey. Her guidance, constructive feedback, and steady encouragement have been invaluable in the development and completion of this dissertation, which I’m glad to present after a long and challenging process. My heartfelt thanks also go to Prof. Roelien Goede, who kindly assumed the role of co- supervisor for this study. Her expert advice, supportive feedback, and readiness to assist despite her demanding schedule have been invaluable throughout this journey. I am deeply grateful to my family for their enduring love and unwavering support, es- pecially during moments of doubt, even when they did not fully understand the “why” behind this path. To my mother and my siblings Xolile, Sandile, and Akhona, I just wanna say, “ngiyabonga ngakho konke”. To my colleagues at the Council for Scientific and Industrial Research (CSIR), thank you for your collaboration, encouragement, and shared pursuit of ‘Touching lives through in- novation’, and for being EPIC. Your support has made this process even more meaningful. I also wish to acknowledge the intellectually stimulating environments of the seminars where this work was preliminarily presented, including DHASA, AI Expo Africa, IndabaX South Africa, and Hundzula Retreat. These platforms not only allowed me to share my work but also provided valuable discussions and feedback that refined this study. Finally, I extend my deepest gratitude to the CSIR as an organisation for financially sponsoring this study, and to the National Integrated Cyber Infrastructure System (NICIS) and its Centre for High Performance Computing (CHPC) for providing the infrastructure that enabled the technical execution of this research. i Abstract IsiZulu, one of South Africa’s most widely spoken languages, is classified as a low-resource language, especially regarding digital tools. As part of the Nguni family, isiZulu exhibits complex morphology and conjunctive orthography. These features result in data sparsity, as a single root or stem may appear in numerous morphological variants, complicating language modelling. This underscores the importance of morphological segmentation, a natural language processing NLP task that decomposes words into their smallest meaning- ful units (morphemes). Rule-based methods yield high accuracy in low-resource contexts but typically lack robustness and are costly to develop. Conversely, machine learning ap- proaches require large, high-quality datasets, often unavailable for low-resource languages. To address these challenges, this study employs a hybrid approach using a rule-based sys- tem, the isiZulu Resource Grammar (ZRG), to generate synthetic datasets with varying segmentation granularities. These datasets underwent data augmentation through syntac- tic tree manipulation, significantly increasing their size and diversity. Subsequently, this data trained supervised machine learning models: Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), and Transformer-based models—for morphological segmentation. The effectiveness of these models was assessed intrinsically, using precision, recall, F1 score, BLEU, and chrF, and extrinsically, by evaluating their impact on Neural Machine Translation (NMT) quality for isiZulu-English translation. Intrinsic evaluation showed that the Transformer model consistently outperformed the CRF and LSTM mod- els, achieving segmentation accuracy above 0.9 across all metrics and granularity styles. Additionally, the hybrid approach demonstrated superior robustness, effectively handling out-of-vocabulary (OOV) words and performing segmentation 30 times faster than ZRG alone. Extrinsic evaluation confirmed that segmentation improved translation quality, with Segmenter Two achieving the highest BLEU score (0.235), representing a 25.0% improve- ment over the unsegmented baseline (0.188). These findings highlight the effectiveness of integrating rule-based and machine learning approaches for morphological segmentation, offering a scalable solution for processing low-resource languages with complex morpholo- gies such as isiZulu in NLP applications. Keywords: Agglutinative Languages; Morphological Segmentation; isiZulu; Supervised Segmenter Learning; Rule-Based Segmenter. ii Contents 1 Introduction 1 1.1 Introduction to the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background and Contextualisation . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Research Questions, Objectives and Hypotheses . . . . . . . . . . . . . . . . 7 1.4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.1 Data Collection and Use . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.2 Privacy and Cultural Sensitivity . . . . . . . . . . . . . . . . . . . . 10 1.6.3 Transparency and Reproducibility . . . . . . . . . . . . . . . . . . . 10 1.7 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Research Methodology 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 The Concept of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iii CONTENTS iv 2.3 Research Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Ontological Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Epistemological Assumptions . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Axiological Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Research Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Positivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Interpretivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Pragmatism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.4 Critical Social Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.5 Critical Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Positioning the Present Study . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Hypothetico-Deductive Methodology . . . . . . . . . . . . . . . . . . . . . . 23 2.6.1 Application in this Research . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Morphological Segmentation 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Background and Linguistic Characteristics of isiZulu . . . . . . . . . . . . . 31 3.2.1 Background and History of isiZulu . . . . . . . . . . . . . . . . . . . 31 3.2.2 Linguistic Characteristics of isiZulu . . . . . . . . . . . . . . . . . . 34 3.2.3 Morphological Characteristics: Overview . . . . . . . . . . . . . . . . 38 3.3 Morphological Segmentation: Overview . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Introduction to Morphological Segmentation . . . . . . . . . . . . . 44 3.3.2 Canonical Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 44 CONTENTS v 3.3.3 Surface Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Approaches to Morphological Segmentation . . . . . . . . . . . . . . . . . . 45 3.4.1 Rule-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Statistical Machine Learning Approaches . . . . . . . . . . . . . . . 52 3.4.3 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Morphological Segmentation Metrics . . . . . . . . . . . . . . . . . . . . . . 74 3.5.1 Intrinsic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.2 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4 Data Preparation 82 4.1 Ukwabelana Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.1 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.2 Development and Annotation Process . . . . . . . . . . . . . . . . . 84 4.1.3 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 National Centre for Human Language Technology Text Corpora . . . . . . . 84 4.2.1 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.2 Development and Annotation Process . . . . . . . . . . . . . . . . . 85 4.2.3 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Reflection on ZRG’s Limitations and Advantages . . . . . . . . . . . . . . . 86 4.3.1 Limitations of the ZRG . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.2 Strengths of the ZRG . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2 Data Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 CONTENTS vi 4.5 Data Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.1 Dataset Preparation and Batching . . . . . . . . . . . . . . . . . . . 94 4.5.2 Parallel Processing with Docker . . . . . . . . . . . . . . . . . . . . . 94 4.5.3 Runtime Parsing and Output Management . . . . . . . . . . . . . . 94 4.5.4 Error Handling and Output Consolidation . . . . . . . . . . . . . . . 95 4.5.5 Parsing Output Example . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.6 Processing Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.7 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6.1 Workflow for Data Augmentation . . . . . . . . . . . . . . . . . . . . 101 4.6.2 Augmentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 102 4.7 ZRG Linearisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.7.1 Different Linearisation Strategies . . . . . . . . . . . . . . . . . . . . 108 4.7.2 Application of Linearisation Strategies . . . . . . . . . . . . . . . . . 109 4.8 Final Dataset Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5 Design of Models 113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Models Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.1 Criteria for Model Selection . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.2 Models Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3 Models Design and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3.1 Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3.2 Long Short-Term Memory Models . . . . . . . . . . . . . . . . . . . 132 CONTENTS vii 5.3.3 Conditional Random Field Models . . . . . . . . . . . . . . . . . . . 134 5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6 Experiment Evaluation 141 6.1 Model Performance Analysis: Training and Validation . . . . . . . . . . . . 142 6.1.1 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1.2 Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2.1 Model Selection for Downstream Evaluation . . . . . . . . . . . . . . 148 6.2.2 Data Augmentation Impact . . . . . . . . . . . . . . . . . . . . . . . 149 6.2.3 Investigating the Robustness of the Transformer Models . . . . . . . 150 6.2.4 Investigating Efficiency of the Transformer Model . . . . . . . . . . . 152 6.3 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.3.1 Neural Machine Translation Model . . . . . . . . . . . . . . . . . . . 155 6.3.2 Neural Machine Learning Evaluation . . . . . . . . . . . . . . . . . . 158 6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7 Findings, Conclusion, and Future Research 167 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.2 Summary of the Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3 Findings Regarding Objectives and Research Questions . . . . . . . . . . . 170 7.3.1 Summary of Achievements in Relation to the Objectives . . . . . . . 172 7.4 Evaluating the Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.4.1 First Hypothesis: Performance of the Hybrid System . . . . . . . . . 174 7.4.2 Second Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.5 Answering the Research Questions . . . . . . . . . . . . . . . . . . . . . . . 179 CONTENTS viii 7.5.1 Research Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.5.2 Research Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.6 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 A Appendix 209 A.1 Models Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.1.1 Hyper-Parameters for Unaugmented Models . . . . . . . . . . . . . . 209 A.1.2 Hyper-Parameters for English-to-IsiZulu Translation Models . . . . . 209 List of Figures 2.1 Summary of the research methodology. . . . . . . . . . . . . . . . . . . . . . 29 3.1 Distribution of isiZulu Speakers Across South African Provinces, Based on Statistics 2022 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Official South African Southern Bantu languages Guthrie group. . . . . . . 33 3.3 A simple diagram of the encoder-decoder. . . . . . . . . . . . . . . . . . . . 64 4.1 Parse abstract syntax tree generated by ZRG with default segmentation. . . 97 4.2 The first found subtrees from the bigger tree. . . . . . . . . . . . . . . . . . 99 4.3 The second found subtrees from the bigger tree. . . . . . . . . . . . . . . . . 100 4.4 The abstract syntax tree for “amantombazane amahle amathathu ahlala ekhaya nomama wabo”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5 The abstract syntax tree for “ubaba kade enezimoto ezimbili ezimbi ezishayelwa abafana bakhe abadlala ibhola”. . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6 The abstract syntax tree for “abafana bakabhuti banezinja ezinenkani”. . . . 110 5.1 An adapted vanilla Transformer architecture. . . . . . . . . . . . . . . . . . 130 6.1 Transformer: training loss over epochs for different Segmenters. . . . . . . . 143 6.2 LSTM: training loss over epochs for different segmenters. . . . . . . . . . . 144 6.3 Transformer: validation loss over epochs for different segmenters. . . . . . . 145 6.4 LSTM: validation loss over epochs for different segmenters. . . . . . . . . . 146 ix LIST OF FIGURES x 6.5 Training loss across epochs: Translation models with varying segmentation strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.6 Validation loss across epochs: Translation models with varying segmenta- tion strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.7 Validation loss across epochs: for the Unsegmented model. . . . . . . . . . . 161 List of Tables 3.1 Distribution of languages spoken in South Africa. . . . . . . . . . . . . . . . 34 3.2 Noun classes in isiZulu showing singular and plural examples. . . . . . . . . 42 3.3 Contingency table: confusion matrix. . . . . . . . . . . . . . . . . . . . . . . 75 4.1 Manipulation of terms and resulting sentences. . . . . . . . . . . . . . . . . 107 4.2 Comparison of linearisation strategies for the sentence “abafana bakabhuti banezinja ezinenkani”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1 Summary of data splitting for training, validation, and testing. . . . . . . . 126 5.2 Transformer model architecture parameters. . . . . . . . . . . . . . . . . . . 126 5.3 Hyper-parameter search space configuration. . . . . . . . . . . . . . . . . . . 127 5.4 Final model configurations for segmenters across different linearisation strate- gies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.5 LSTM model default configuration. . . . . . . . . . . . . . . . . . . . . . . . 133 5.6 Hyper-parameter search space. . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.7 LSTM optimal hyper-parameters for Segmenters One, Two, and Three. . . 134 5.8 CRF model base parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.9 Grid search parameter ranges for GridSearch. . . . . . . . . . . . . . . . . . 137 5.10 Optimal CRF parameters across different segmentation schemes. . . . . . . 139 6.1 Morphological segmentation examples: input words with target and Seg- menter One predicted segmentations. . . . . . . . . . . . . . . . . . . . . . . 146 xi LIST OF TABLES xii 6.2 Model performance comparison across different segmentation approaches (BLEU and chrF scores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3 Comparison of precision, recall, and F1 scores across different architectures and segmentation styles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Comparison of key metrics scores between Transformer models trained on unaugmented data and augmented data. . . . . . . . . . . . . . . . . . . . . 149 6.5 Segmentation results from different segmenters. . . . . . . . . . . . . . . . . 151 6.6 Gold standard segmentations and failure analysis. . . . . . . . . . . . . . . . 151 6.7 Evaluation metrics for different segmenters. . . . . . . . . . . . . . . . . . . 152 6.8 Execution times for segmenters using ZRG and Transformer models. . . . . 153 6.9 Optimal hyper-parameters for translation models across different segmen- tation approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.10 IsiZulu–English translation metrics for Segmentation models (test set). . . . 162 6.11 Translation quality metrics for different Segmentation models in English- isiZulu translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.12 Translation scores for Segmentation models on updated FLORES (isiZulu–English).163 6.13 Translation quality metrics scores for different segmentation styles using the updated FLORES dataset (English-isiZulu). . . . . . . . . . . . . . . . . 165 A.1 Model configurations for The Unaugmented Segmentation Models . . . . . . 209 A.2 Model Configurations for English-to-IsiZulu Translation With and Without Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Table of abbreviations A table containing a list of abbreviations that will be used throughout text. ZRG isiZulu Resource Grammar NMT Neural Machine Translation NLLB No Language Left Behind FLORES Few-shot Learning for Zero-shot Evaluation in Translation CRF Conditional Random Fields LSTM Long Short-Term Memory GRU Gated Recurrent Units BLEU Bilingual Evaluation Understudy chrF Character F-score MEMM Maximum Entropy Markov Model OOV Out-of-Vocabulary POS Part-of-Speech GF Grammatical Framework AST Abstract Syntax Tree ML Machine Learning HPO Hyper-Parameter Optimisation GPU Graphics Processing Unit PGF Portable Grammar Format AI Artificial Intelligence NLP Natural Language Processing RNN Recurrent Neural Network FFN Feed-Forward Network Lin A Fine-grained Segmentation Strategy Lin B Moderate Segmentation Strategy Lin C Coarse-grained Segmentation Strategy HPO Hyper-Parameter Optimisation xiii Chapter 1 Introduction 1.1 Introduction to the Study Morphological segmentation is a Natural Language Processing (NLP) task that has shown much potential to improve the automatic processing of morphologically complex languages such as isiZulu (Creutz, 2006; Creutz & Lagus, 2007; Mager et al., 2022; Tukeyev et al., 2020). Approaches to segmentation be divided into two broad categories: rule-based systems and data-driven (machine learning) techniques, which can be further divided into unsupervised, supervised, and semi-supervised methods. While fully supervised methods typically achieve high performance, they rely on substantial amounts of annotated data resources that are often unavailable for low-resourced languages like isiZulu. As a result, researchers often have to rely on rule-based, unsupervised, or semi-supervised methods, which require fewer annotations. However, these methods present notable limitations, including lower accuracy, scalability challenges, and a heavy dependence on linguistic expertise. These constraints hinder their integration into NLP pipelines that require high- precision segmentation, emphasising the need for more effective supervised approaches tailored to low-resource languages. The present study aims to present insights into the use of an existing rule-based system to generate synthetic data for investigating supervised morphological surface segmentation of isiZulu at different granularity levels. The goal is to leverage the strengths of both rule-based systems and machine learning approaches to overcome the weaknesses of each approach. Due to the lack of substantial annotated data for morphological segmentation, this study uses isiZulu Resource Grammar (ZRG), which is a rule-based grammar developed using the Grammatical Framework (GF) programming language, to generate morphologically surface-segmented isiZulu text with three different segmentation styles. Additionally, the 1 CHAPTER 1. INTRODUCTION 2 study explores data augmentation as a strategy to expand the dataset further. ZRG performs a deep morphosyntactic analysis of the text, which, while linguistically robust, is computationally slow. Furthermore, as a rule-based system, it is brittle when encountering words outside its predefined lexicon, limiting its generalisability. To enhance segmentation beyond the limitations of rule-based methods, the research adopts a machine learning-based approach to morphological surface segmentation, leverag- ing supervised learning techniques. These methods are preferred due to their adaptability and superior performance, especially in real-world applications where flexibility and ro- bustness are critical. 1.2 Background and Contextualisation IsiZulu is one of South Africa’s 12 official languages. It is a member of the Nguni lan- guage group, which forms part of the larger Niger-Congo language family (Mesham et al., 2021). Despite being widely spoken, isiZulu, along with other Nguni languages (isiX- hosa, Siswati, and isiNdebele), is considered a low-resourced language. Indeed, isiZulu is the most widely spoken language in South Africa, with approximately 15 million native speakers, accounting for 24% of the country’s population (StatsSA, 2022b). One of the notable characteristics of Nguni languages is their agglutinating morphology and conjunctive orthography, as compared to disjunctive orthography, which is commonly associated with other South African languages, more specifically the Sotho languages. While Sotho languages share the agglutinating morphology with isiZulu and other Nguni languages, they have disjunctive orthography, where morphemes are written with spaces in between (Bosch & Pretorius, 2002; Taljard & Bosch, 2006). IsiZulu is considered a low-resourced language due to limited access to linguistic resources such as machine-readable texts. This scarcity of resources has significantly hindered the development of technological tools that are necessary for ensuring the long-term digi- tal preservation and vitality of isiZulu and similar languages (Loubser & Puttkammer, 2020). This challenge is particularly pronounced for resource-scarce agglutinative lan- guages, where a single root or stem can combine with numerous morphemes to generate hundreds of unique word forms. For Nguni languages like isiZulu, the use of conjunctive or- thography further complicates this issue. The result is data sparsity, as even large corpora may fail to encompass the full range of root and stem variations. This data sparsity cre- ates significant challenges for language modelling approaches that rely on token identities. Consequently, these challenges have motivated the present research into morphological segmentation as a potential solution. CHAPTER 1. INTRODUCTION 3 Morphological segmentation refers to the process of decomposing an orthographic word (its written form) into its constituent morphemes, which are the smallest meaning-bearing units of language (Creutz & Lagus, 2007). This technique has been found to be useful in NLP applications such as machine translation (MT) (Mager et al., 2022; Tukeyev et al., 2020) and automatic speech recognition (ASR) (Creutz, 2006), mainly when working with low-resource languages that feature complex morphological structures. This entails addressing the combinatorial explosion inherent in agglutinative morphology, where modi- fications in morpheme attachment to a root or stem can completely alter a word’s meaning. Such complexity often results in corpora containing limited variations of stems, leading to language models that struggle with out-of-vocabulary (OOV) issues when morphological segmentation is not applied. By breaking words into smaller, meaningful units, morpholog- ical segmentation facilitates better generalisation and improves performance in language modelling tasks for morphologically rich, resource-scarce languages. The effort towards morphological segmentation has been divided into two goals: canonical segmentation and surface segmentation (Cotterell et al., 2016a). Canonical segmentation entails segmenting a word into its underlying morphemes, with the morphemes represented in their canonical or lemmatised forms. The concatenation of these morphemes does not necessarily result in the orthographic word, whereas with surface segmentation, the word is divided into morpheme-based substrings (morphs) that can be combined to form the orthographic word (Cotterell et al., 2016a). There are two common techniques that are used for morphological segmentation: rule- based and machine learning-based (ML) approaches (Anand Kumar et al., 2010; Gold- smith, 2001). Rule-based segmentation relies on predefined rules based on expert knowl- edge as well as the morphological structure of the language (Eiselen & Puttkammer, 2014a). The rule-based methods are considered to be time-consuming and expensive to develop, and since they use rules to segment texts, they tend to be fragile when encounter- ing new words not within the lexicon (Villegas-Ch et al., 2024). Despite these drawbacks, rule-based methods remain helpful when developed more especially in languages with well- documented morphological rules, serving as baselines or complementary tools when using hybrid approaches (Grönroos et al., 2016; Zbib et al., 2012). Machine learning-based segmentation, on the other hand, uses algorithms to learn from data and identify morphemes based on stochastic and probabilistic methods (Goldsmith, 2001). Unlike rule-based systems, these techniques do not rely on explicitly defined rules; instead, they are heavily dependent on the availability of sufficient training data, which may compromise their reliability if inadequate. The primary machine learning methods are supervised, unsupervised, semi-supervised, and reinforcement learning (Murphy, 2012a), with the first three being the most commonly used in morphological segmentation. CHAPTER 1. INTRODUCTION 4 The data set used in these machine learning-based approaches for morphological seg- mentation can either be unsegmented or segmented by having predefined morphological boundaries before being used to train the model. The former approach corresponds with unsupervised segmentation, where the model learns segmentation patterns directly from unannotated data, while the latter corresponds with supervised segmentation, where the model is trained on data with predefined segmentation (Eskander et al., 2022). The third approach, known as semi-supervised, combines these approaches. All these techniques re- quire a substantial amount of data to learn the segmentation pattern accurately, and their inherent adaptability enables them to perform well even on new words. This flexibility renders them suitable for deployment in real-world applications where not all words are present in the training set. However, obtaining sufficient annotated or unannotated data is a significant challenge, particularly for low-resource languages. Morphological segmentation plays a critical role in NLP, particularly for morphologically complex and low-resource languages like isiZulu. The choice between surface and canonical morphological segmentation is influenced by several factors, including the specific linguistic characteristics of the language in question, the intended application of the segmentation, and the availability of annotated data (Occhipinti, 2024; Rice et al., 2024). When comparing surface segmentation with canonical segmentation, surface segmentation presents distinct advantages, particularly for language modelling. Canonical segmentation decomposes a word into its underlying morphemes in their canonical (or lemmatised) forms (Cotterell et al., 2016a). While valuable for morphological analysis, these resulting morphemes often do not correspond directly with the actual orthographic forms of the language. This discrepancy can be problematic for language modelling, because the model may struggle to map canonical forms back to their surface realisations. In contrast, surface segmentation divides words into morpheme-based substrings that can be reassembled into the original orthographic word. This method captures the surface forms of words, including phonological variations that occur when morphemes combine. A fundamental challenge in terms of surface segmentation is determining the optimal seg- mentation granularity; that is the level of detail or fineness in which words are segmented into their morphemic components. When surface segmentation is used, a word may have multiple valid segmentations depending on the granularity level. However, determining the optimal segmentation granularity for training language models remains an open re- search question (Ataman & Federico, 2018; Meyer & Buys, 2022; Salesky et al., 2020). If segmentation is too aggressive, language models may have to learn too much compo- sitional semantics to accurately capture the meaning, while if it is too conservative, the data sparsity problem may not be sufficiently addressed. One way that has been explored to address this paradigm is to use methods such as Byte- pair Encoding (BPE), where the hyper-parameter fixing granularity must be optimised CHAPTER 1. INTRODUCTION 5 (Salesky et al., 2020). While BPE and other subword tokenisation techniques attempt to address this issue, there is no consensus on the best granularity level for morphologically rich languages (Salesky et al., 2020). Most morphological (canonical or surface) segmentation work in the literature follows the unsupervised approach, which is especially common for low-resource languages like isiZulu that lack sufficient annotated data (Bergmanis & Goldwater, 2017; Creutz et al., 2007a; Goldsmith, 2001; Hammarström & Borin, 2011; Mzamo et al., 2019a). This technique is attractive because it can be applied to any language with adequate digital textual data without needing expert annotations (Can & Manandhar, 2014; Ruokolainen et al., 2016). However, despite its popularity, the unsupervised approach tends to achieve lower accuracy in morphological segmentation due to the lack of linguistic guidance (Wang et al., 2016b). This limitation reduces their effectiveness in downstream applications, where precise segmentation may have a significant effect on performance. An alternative to unsupervised morphological segmentation is supervised segmentation, which relies on annotated datasets (Moeng et al., 2022; Pranjić et al., 2024; Puttkammer & Du Toit, 2021). Supervised machine learning approaches are often preferred due to their higher accuracy and their ability to incorporate linguistic knowledge (Belth, 2024). Once trained, these models can efficiently segment words, generalising well to unseen data, making them more robust and less brittle than rule-based approaches. However, a major drawback of supervised segmentation is the need for large amounts of annotated data, which is costly and labour-intensive to produce, especially for low-resource languages. To mitigate this, researchers have explored semi-supervised approaches, which leverage a small set of annotated data alongside unlabelled data to improve performance while reducing annotation effort (Ruokolainen et al., 2014; Yusupujiang, 2018). In contrast, rule-based morphological segmentation relies on predefined linguistic rules crafted by experts to reflect a language’s morphological structure (Eiselen & Puttkam- mer, 2014a; Villegas-Ch et al., 2024). While this approach can yield high accuracy in well-defined linguistic contexts, it has several drawbacks. Rule-based systems are labour- intensive, expensive to develop, and often struggle with adaptability (Dasgupta & Ng, 2007; Villegas-Ch et al., 2024; Westhelle et al., 2022). Their performance also deteriorates when encountering new or unpredictable words, limiting their applicability in dynamic linguistic environments. Scalability is another significant challenge; for languages with complex morphologies, the large number of required rules can be overwhelming. Fur- thermore, morphophonological alternations introduce additional complexities, making it difficult to capture all variations through static rules (Joseph & Anto, 2015; Selvam & Natarajan, 2009). The choice of segmentation approach depends on available linguistic resources and in- tended applications. Rule-based methods are effective when comprehensive linguistic CHAPTER 1. INTRODUCTION 6 knowledge is available, whereas unsupervised approaches are preferable when annotated data is scarce. Supervised learning, despite its reliance on annotated data, offers superior accuracy and generalisation when adequate training data is available. In data-sparse set- tings, semi-supervised techniques provide a viable compromise, leveraging small annotated corpora alongside large raw text corpora for enhanced learning. To date, supervised surface segmentation for isiZulu remains underexplored, despite its potential to improve downstream NLP tasks. A primary limitation in this regard is the lack of annotated surface segmented data required for training such models. To address this gap, the present study leverages the ZRG (Marais & Pretorius, 2023), a rule-based system implemented using the GF programming language. The ZRG allows for the morphosyntactic analysis of isiZulu by generating syntax trees through parsing, which involves a deep linguistic analysis of the text. While this process can be slow, it exposes lexical categories (e.g., tense, polarity, and agreement markers) that can be systematically manipulated to generate synthetic data, a process referred to as data augmentation. Additionally, the ZRG supports a linearisation process, which converts syntax trees into human-readable text. During linearisation, the grammar employs a binding token, fa- cilitating the construction of surface forms and allowing for the specification of differ- ent segmentation granularities. This method allows for flexibility regarding segmentation strategies, enabling the exploration of optimal strategies based on linguistic approaches to isiZulu subword modelling. It is essential to note that ZRG, being a rule-based system, suffers from the aforementioned limitations of rule-based approaches, such as being less robust when encountering words with the defined lexicon, hence failing to parse it and generate its syntax tree. It is also very slow, making it a less preferred strategy to be used for downstream tasks, where speed and accuracy are important factors. Moreover, ZRG was not originally implemented as a surface segmentation tool. 1.3 Problem Statement To enable the interplay between rule-based and machine-learning approaches in investigat- ing the morphological surface segmentation of isiZulu, this work leverages the strengths of the ZRG. The ZRG provides broad coverage of isiZulu, supports flexible segmentation styles, and enables the generation of surface-segmented data at different granularities in a linguistically informed manner. This approach eliminates the need for extensive system engineering or expert curation, which can be costly and time-consuming. However, recog- nising the limitations of rule-based systems, such as their lack of robustness and slower CHAPTER 1. INTRODUCTION 7 processing speeds, this study proposes training a supervised machine learning model using the ZRG-generated annotated data. Supervised models are well-suited for generalising to unseen words, offering greater robustness and efficiency in segmentation, hence serving better as a pipeline to be used for investigating morphological segmentation in downstream tasks where fast and accurate segmentation is crucial. 1.4 Research Questions, Objectives and Hypotheses The discussion in the previous section leads to the two research questions to be addressed in this study. 1.4.1 Research Questions This research seeks to address the following questions: 1. How can a hybrid approach to supervised surface segmentation be pursued that leverages the strengths of a rule-based approach to overcome the weaknesses of a machine learning approach in the context of isiZulu? 2. How can such a hybrid approach be used to obtain insights into supervised surface segmentation strategies for isiZulu? 1.4.2 Research Objectives In order to answer the research questions, the following primary objective will be pursued. To investigate supervised surface segmentation of isiZulu text using synthetic data gener- ation In order to reach this primary objective, the following sub-objectives have been identified. • Sub-objective 1 : To utilise the ZRG to parse a large corpus of isiZulu text to obtain a treebank. • Sub-objective 2 : To utilise the ZRG to augment the treebank to obtain additional synthetic data. • Sub-objective 3 : To investigate the suitability of different machine learning models for the supervised segmentation task. CHAPTER 1. INTRODUCTION 8 • Sub-objective 4 : To train segmentation models on different linearisations of the treebanks. • Sub-objective 5 : To evaluate segmentation models intrinsically and extrinsically ac- cording to suitable metrics. 1.5 Research Methodology This study employs the hypothetico-deductive methodology to investigate the morpholog- ical surface segmentation of the isiZulu text. Central to this methodology is the use of a rule-based GF grammar, ZRG, to generate a large corpus of surface-segmented isiZulu text across different granularities. This ensures linguistically informed and accurate seg- mentation. To further mitigate the matter of data sparsity, data augmentation techniques are applied to enhance the diversity of the generated corpus. These datasets are used to train su- pervised segmentation models, which are subsequently evaluated both intrinsically (using metrics such as F1-score, BLEU, and chrF) and extrinsically through their impact on neural machine translation (NMT) performance. The approach integrates comprehensive data preparation, segmentation modelling, and systematic evaluation to address linguistic and technological challenges inherent to low- resource, morphologically complex languages like isiZulu. By employing this hypothesis- driven methodology, the research ensures a structured and systematic exploration of the role of morphological segmentation in enhancing downstream NLP tasks such as NMT. Detailed discussions of each phase, including corpus generation, segmentation modelling, and evaluation strategies, are presented in subsequent chapters. The hypotheses governing this method are outlined below. 1.5.1 Hypotheses In line with the hypothetico-deductive methodology, the following hypotheses are investi- gated. 1. In developing a hybrid approach to supervised surface segmentation of isiZulu, using a machine learning approach as a foundation will ensure robustness and efficiency, while a combination of synthetically generated and automatically annotated data would address the requirement of machine learning approaches for large amounts of training data. CHAPTER 1. INTRODUCTION 9 These sub-hypotheses have been developed: (a) The hybrid approach will exceed the rule-based approach in robustness; (b) The hybrid approach will exceed the rule-based approach in efficiency; and (c) The hybrid approach will reach a sufficient1 level of accuracy in comparison to the rule-based approach. 2. The hybrid approach will result in one or more segmenters that improve performance in a downstream task such as machine translation. 1.6 Ethical Considerations This study adheres to ethical principles to ensure responsible research practices in the development and evaluation of morphological segmentation. Ethical concerns are particu- larly relevant during data collection, processing, and model evaluation, given the potential social and linguistic implications of working with a low-resource language. IsiZulu, like many other low-resource languages, lacks large publicly available datasets that are both diverse and high quality. This poses ethical challenges during data collec- tion, as the scarcity of digital data resources increases the risk of bias, inaccuracies, and misrepresentations that could negatively impact the digital development of the language. To mitigate these risks, careful and ethically responsible data collection is essential to ensure fair representation, linguistic integrity, and minimal bias in the development of computational models. 1.6.1 Data Collection and Use The data used in this study was sourced from publicly available corpora and other com- monly used datasets in NLP research. However, these datasets may have been originally collected through web scraping or other automated means, making them susceptible to biases present on the internet or other sources. Consequently, it is possible that inherent biases from the original data sources could be reflected in the dataset used for training the models in this study. In addition to that, special care has been taken to ensure that the collection and processing of isiZulu text respect copyright regulations and do not violate any intellectual property rights. The study follows best practices for dataset attribution, and proper citations are provided for all corpora used. 1The notion of sufficiency is discussed in detail in Chapter 7. CHAPTER 1. INTRODUCTION 10 1.6.2 Privacy and Cultural Sensitivity Since the datasets primarily consist of linguistic data, they contain minimal personally identifiable information (PII), making privacy risks relatively low. However, ethical con- siderations go beyond privacy concerns and extend to the linguistic and cultural sensitivity associated with working with indigenous languages such as isiZulu. This research does not seek to alter the natural structure of isiZulu, but rather aims to enhance its digital representation in NLP applications. This is made possible through the ZRG, which has been developed with a focus on linguistic transparency and integrity (Marais & Pretorius, 2023). 1.6.3 Transparency and Reproducibility To promote open science and ensure that the findings contribute to the broader NLP research community, this study adheres to transparency and reproducibility standards by: 1. Clearly documenting data sources, pre-processing steps, and model training config- urations. All related resources will be made publicly available and accessible in the GitHub repository2. 2. Providing implementation details to enable reproducibility. 3. Sharing insights into the strengths and limitations of the proposed methods. By addressing these ethical considerations, this study aims to contribute responsibly to NLP research while upholding principles of fairness, linguistic inclusivity, and social re- sponsibility. Additionally, this research proposal was submitted for ethical approval to North-West University, where it was reviewed and approved by the relevant Research Ethics Committee. The assigned ethics approval number is NWU-01328-23-A9. 1.7 Dissertation Structure This study is made up of the following chapters: Chapter 1 – Introduction: This chapter provides an overview of the research, including the problem statement and the research objectives, and summarises the research methodology. It sets out the research context and outlines the thesis’s structure. 2https://github.com/Sthesha/supervised-surface-segmentation CHAPTER 1. INTRODUCTION 11 Chapter 2 – Research Methodology: This chapter outlines the hypothetico-deductive method- ology guiding this study. It establishes the research paradigms and provides a structured methodology for the investigation of morphological surface segmentation for isiZulu texts using supervised machine learning approaches to gain insight into different segmentation strategies. Chapter 3 – Morphological Segmentation: This chapter provides an overview of morpho- logical segmentation, covering issues pertaining to linguistic background, rule-based ap- proaches with examples, machine learning techniques, as well as metrics and benchmarking used for evaluating segmentation. Chapter 4 – Data Preparation: This chapter outlines the process of preparing the data for training the supervised surface segmentation models. It includes the generation of surface segmented data using rule-based GF grammar, data augmentation techniques, and the creation of a representative dataset for isiZulu. Chapter 5– Models Design: This chapter outlines the design and development of the supervised surface segmentation models, detailing the selection of learning algorithms, model architecture, parameter optimisation, and the training process. It also provides the rationale for design choices and their alignment with the research objectives. Chapter 6 – Experimental Evaluation: This chapter presents an evaluation of the devel- oped supervised segmentation models, focusing on both intrinsic metrics (e.g., F1-score, BLEU, chrF) to assess segmentation quality and extrinsic metrics to evaluate their impact on downstream NMT performance. It includes comparative analyses across segmentation granularities and a discussion of the results in the context of addressing data sparsity in isiZulu. Chapter 7 - Findings, Conclusion and Future Research: This chapter provides a concise summary of the preceding chapters, the results of the formulated hypotheses, and based on the key findings, it addresses the outlined research questions. In addition, it reflects on the challenges and limitations that have been identified and suggests future research directions. Finally, it provides the study’s overall conclusions. 1.8 Chapter Summary This chapter provided an introduction, background, and overview of the study. These were followed by its research objectives, focusing on addressing the challenges posed by data sparsity and morphological complexity in isiZulu, a low-resource and morphologically rich language. Through a systematic exploration of the research aims, the study emphasises the potential of supervised surface segmentation models in enhancing the performance of NLP applications. CHAPTER 1. INTRODUCTION 12 The objectives provide a clear roadmap for the study, beginning with the generation of a large, linguistically valid, and representative corpus of surface-segmented isiZulu text using a rule-based grammar, known as ZRG. The study also proposed the integration of data augmentation techniques to address data sparsity and the identification of the most effective supervised learning algorithms for this investigation which also underline the study’s focus on the methodological rigour approach. This methodology is grounded in the hypothetico-deductive approach, which formulates several hypotheses that are aligned with the study objectives and tests them accordingly. Finally, the inclusion of an extrinsic evaluation objective, where the impact of segmentation granularity is measured through NMT performance, emphasises the practical relevance and broader implications of this research. By addressing these objectives, this study aims to contribute meaningfully to the field of NLP for low-resource languages, setting the stage for future advancements in morphological segmentation and its applications. Chapter 2 Research Methodology 2.1 Introduction The present chapter sets out to present a methodology for developing a supervised surface segmentation model for isiZulu using synthetic data generated using the ZRG and evaluat- ing its effectiveness in addressing the data sparsity issue in a downstream task, specifically a machine translation system. Building upon the foundational concepts introduced in the previous chapter, this chapter elaborates on the systematic approach adopted to achieve the research objectives. This chapter explores the research philosophies and paradigms that inform the study and outlines the specific methodological framework implemented. The focus of this study is to investigate the surface segmentation of isiZulu text with different levels of granularity using a supervised machine learning approach developed using synthetic data. By leveraging the hypothetico-deductive method as the guiding approach, this study enables hypothesis formulation and empirical testing, ensuring that the chosen methodologies are well suited to tackle the challenges of agglutinative languages such as isiZulu. By embedding the research within established philosophical and paradigmatic frameworks, this chapter not only establishes a foundation for a methodological approach, but also emphasises the significance of morphological segmentation in terms of addressing data sparsity for low-resource languages with complex morphologies, such as isiZulu, in the context of language modelling. This chapter is structured as follows: Section 2.2 provides a foundational perusal of re- search as a concept, leading to Section 2.3 and Section 2.4, which discuss different research paradigms and philosophies. These are presented as assumptions and beliefs that guide a researcher’s approach to conducting a study. Section 2.5 positions the present study within the introduced assumptions and beliefs, which also influence the chosen method- 13 CHAPTER 2. RESEARCH METHODOLOGY 14 ology, namely the hypothetico-deductive methodology, which is discussed in Section 2.6. Finally, Section 2.7 concludes the chapter with a summary. 2.2 The Concept of Research The term “research” has been used interchangeably in the literature, encompassing various interpretations. According to the Cambridge Dictionary (n.d.), research is defined as, “a detailed study of a subject, especially in order to discover (new) information or reach a (new) understanding”. Khaldi (2017) describes research as the systematic acquisition of knowledge through meticulous and structured investigation. Similarly, Dane (1990) emphasises research as a process aimed at asking and answering questions about the world. As a process, research enables the adoption, refinement, or rejection of certain knowledge based on evidence and analysis. The aims of the research are extensive, including exploring new phenomena, explaining causes and relationships, evaluating interventions, predicting outcomes, understanding complex systems, and developing practical solutions to real- world problems (Collis & Hussey, 2014; Garg, 2016). In addition, research plays a crucial role in contributing towards theory development and testing, providing a foundation for scientific progress and application (Whetten, 1989). According to Goundar (2012), three key conditions must be met when undertaking re- search. Firstly, the research process must be governed by clearly defined methodologies, such as qualitative or quantitative approaches, and influenced by the standards and prac- tices inherent to the researcher’s discipline. Secondly, the researcher should employ tools, methods, and techniques that have been rigorously tested for reliability and validity. Reli- ability pertains to the consistency and repeatability of the chosen methods, while validity ensures that these methods are chosen correctly and accurately measure the intended phe- nomena. Lastly, the researcher must conduct the research objectively and without bias, thus striving to eliminate bias or vested interests that may compromise the impartiality and credibility of the outcomes. Throughout the research process, conclusions should be reached based on the evidence gathered, avoiding the incorporation of personal biases or interests unrelated to the study’s objectives (Goundar, 2012). Having established the foundational understanding of research as a systematic process for acquiring and refining knowledge, it is essential to explore the broader philosophical and paradigmatic frameworks that underpin and guide research practices. These frameworks, including the research paradigms and research philosophies, serve as the theoretical lenses through which researchers view the world, formulate questions, and select appropriate methods. By exploring these paradigms and philosophies, one can better understand how they influence the design, execution, and interpretation of this study. CHAPTER 2. RESEARCH METHODOLOGY 15 2.3 Research Paradigms It is salient to consider the concept of research paradigms, because these guide the ac- quisition of knowledge and scientific discoveries through their underlying principles and assumptions (Park et al., 2020). The term “paradigm”, often referred to as a worldview, encompasses the fundamental collection of beliefs, theories, philosophical assumptions, and ideas that researchers hold. These paradigms shape the design and execution of their research (Mafuwane et al., 2011). A research paradigm provides a lens through which researchers interpret the world and evaluate the methodological aspects of their study, ultimately influencing the choices for data collection and analysis strategies. Conducting scientific research requires careful consideration of the research paradigm, as this paradigm establishes specific assumptions about how the world operates. Indeed, My- ers (2002) emphasises the importance of adopting an appropriate research paradigm to en- sure validity, thus allowing researchers to build on their philosophical stance. Furthermore, Clavier et al. (2012) highlight that paradigms serve as critical reference points for others to understand the researcher’s underlying assumptions, which have a significant bearing on the study’s design and outcomes. These paradigms are commonly categorised into four key domains: ontological, epistemological, axiological, and methodological (Creswell & Creswell, 2017). 2.3.1 Ontological Assumptions Ontological assumptions refer to the nature of reality, whether it is regarded as objective, subjective, or socially constructed, and encompass the aspects to be investigated or en- countered during the research process (Alele & Malau-Aduli, 2023). According to Guba and Lincoln (1994:108), ontological assumptions aim to address the question: What is the form and nature of reality, and what can be known about it? Simply put, these assump- tions focus on understanding the fundamental nature of the phenomena being researched. The nature of reality implies different ontological perspectives, which, in turn, influence the research approach. Assumptions about reality are often categorised along a spectrum, ranging from an objective reality that exists independently of human perception to a sub- jective reality that is shaped by individual experiences and interpretations (Ahmed, 2008). Ontological assumptions play a critical role in structuring a researcher’s thinking about the topic under investigation. Such assumptions are salient in guiding the formulation of research questions and determining how those questions are addressed (Kivunja & Kuyini, 2017). CHAPTER 2. RESEARCH METHODOLOGY 16 2.3.2 Epistemological Assumptions Epistemology is a branch of philosophy that focusses on the study of knowledge and beliefs, describing how knowledge about reality is acquired, conceptualised, and applied (Hatch, 2018). According to Guba and Lincoln (1994:108), this assumption addresses the question: What is the nature of the relationship between the knower or would-be knower and what can be known? The response to this question is often influenced by the ontological assumptions underpinning the research. Epistemology seeks to explore questions such as: How is knowledge produced? What stan- dards distinguish good knowledge from bad knowledge? And how should reality be defined or represented (Hatch, 2018)? Epistemology also pertains to how one can communicate this knowledge to others (Burrell & Morgan, 2019). 2.3.3 Axiological Assumptions Axiology pertains to values and ethics, reflecting the role of a researcher’s personal values and biases in the research process, including actions taken after the research is completed (Saunders, 2009). It influences the entire research process, and is critical for ensuring the credibility and integrity of the study. As noted in Pretorius (2024), axiology encour- ages researchers to reflect on how their own values, beliefs, and biases shape the design, execution, and interpretation of their studies. Furthermore, axiology assumptions are interconnected with ontological and epistemolog- ical assumptions. Understanding the nature of reality (ontology) enables researchers to assess the truth value of knowledge, while determining the knowability of this reality (epis- temology) informs the science of truth. Once these foundations have been established, defining the correct science of values (axiology) becomes more straightforward (Engle, 2008). 2.3.4 Methodology This dimension of the research undertaking is concerned with the general strategy or action plan that guides the choice and use of specific methods in the context of a particular research paradigm (Wahyuni, 2012). It refers to a system of methods, procedures, or principles employed to achieve specific objectives. In research, methodology can be defined as systematic procedures or strategies that researchers use to describe, explain, and predict phenomena or to carry out a research project (Aguiar, 2024). According to Guba and Lincoln (1994), a methodology seeks to determine the steps that an inquirer must take to uncover what they believe can be known. CHAPTER 2. RESEARCH METHODOLOGY 17 A research methodology helps to clarify several aspects of a research project, such as why it was conducted, how and why the hypothesis was formulated, and the methods or techniques chosen to address the research problem. This includes details about the data used, how it was collected, and related questions. A closely related concept and often used interchangeably is research methods, which refer to the specific procedures employed in conducting research (Howell, 2012). As Goundar (2012:45) aptly state, research method- ology, “refers to more than a simple set of methods; rather it refers to the rationale and the philosophical assumptions that underlie a particular study relative to the scientific method”, while research methods constitute the execution phase of both scientific and non-scientific research. Similarly to other paradigmatic assumptions, the specific methodological choices that a researcher adopts are influenced by their responses to ontological, epistemological, and axiological questions. This interconnectedness ensures that all these paradigms work to- gether, enabling researchers to conduct systematic investigations that produce meaningful and reliable knowledge. In terms of methodologies, there are various methodologies that can be employed in differ- ent types of research, and the term is generally considered to include research design, data collection, and data analysis (Goundar, 2012). Research methodologies are commonly cat- egorised into two main types: quantitative and qualitative methodologies (Onwuegbuzie & Leech, 2005). These are briefly discussed below. 1. The quantitative methodology emphasises numerical attributes, relying on objec- tive measurements and statistical analysis to explore the relationships between vari- ables and describe the causes of change (Kornuta & Germaine, 2019). Quantitative research involves gathering numerical (quantitative) data, which is systematically analysed using statistical methods to address specific research questions or hypothe- ses. Researchers who use quantitative methods often propose hypotheses, collect numerical data, and use statistical evidence to support or refute these hypotheses, enabling broader generalisations based on a scientific approach (Rana et al., 2023). 2. In contrast to quantitative methods, qualitative research methodologies utilise non- numerical data, such as observations, textual analysis, and interviews, to answer open-ended questions such as “how” and “why”, which makes it ideal for conduct- ing research in investigating non-linear phenomena such as experiences, perspectives, and behaviours that can be too complex to be captured using quantitative methods (Goundar, 2012; Tenny et al., 2017). Rather than relying on statistical analyses or pre-formulated hypotheses, qualitative studies often involve open-ended inquiry, allowing hypotheses or theories to emerge naturally during the research process (Ko- rnuta & Germaine, 2019). The researcher plays an active role in interpreting data, relying heavily on subjective insights and descriptive observations. While qualita- CHAPTER 2. RESEARCH METHODOLOGY 18 tive approaches inherently involve subjective interpretations by the researcher, these interpretations are grounded in rigorous and structured analysis of collected data, rather than mere personal opinions or ungrounded assumptions. Understanding and reflecting upon these beliefs is essential, as they underpin the method- ological choices and strategies employed in any research project. By clarifying these per- spectives, researchers can ensure that their study aligns with its philosophical foundations, enhancing its rigour and relevance. Again, these philosophical assumptions are part of the research philosophy, which includes various possibilities. 2.4 Research Philosophy The term research philosophy is an evolving concept that lacks a unified definition among scholars and is often used interchangeably with research paradigms. According to Žukauskas et al. (2018), a research philosophy represents the system of thought that a researcher adopts to produce new and reliable knowledge about their research object. Similarly, Saunders (2009) describes it as the development of knowledge and its nature, emphasising that it embodies critical assumptions about a researcher’s perspective of the world, which, in turn, influences their research strategy and methods. This study focuses on a subset of the research philosophies most widely discussed in the social sciences: positivism, interpretivism, pragmatism, and critical realism, as highlighted by Žukauskas et al. (2018) and Ryan (2018). These were selected due to their relevance to the study’s objectives and their widespread application in social science research 2.4.1 Positivism The positivist philosophy asserts that the social world can be studied and understood ob- jectively (Žukauskas et al., 2018), much like the natural world. This philosophical stance emphasises the use of scientific methods to produce credible data and facts, which are considered to be independent of human interpretation or bias. Positivist studies primar- ily rely on quantitative data, utilising experiments and established theories to formulate hypotheses that can be rigorously tested, confirmed, or rejected (Saunders, 2009). A key characteristic of positivist research is its focus on objectivity and detachment, aligning with its axiological assumption. Researchers who adhere to this philosophy strive to minimise personal values or subjective influences that could impact research outcomes (Irshaidat, 2022). The ontological assumption upheld by positivism asserts the existence of a rigid, objective reality which can be understood through appropriate instruments, data collection, and empirical analysis. Epistemologically, positivism assumes that knowledge CHAPTER 2. RESEARCH METHODOLOGY 19 about this reality is acquired through rigorous observational strategies, ensuring that it remains objective and unaltered (Ringberg & Reihlen, 2008). By emphasising empirical evidence and measurable phenomena, positivist approaches seek to arrive at conclusions that are replicable and universally valid. This makes it particu- larly suitable for research that involves systematic observation, controlled experimentation, and hypothesis-driven inquiries. In this study, a positivist approach is adopted to validate claims about the performance of supervised machine learning approaches, when these are trained on a combination of synthetically generated and automatically annotated data, to ascertain whether they enhance the robustness and efficiency of morphological segmenta- tion models for isiZulu by addressing the data sparsity challenge inherent in low-resource, morphologically rich languages. Having established the foundational understanding of research as a systematic process of acquiring and refining knowledge, it is essential to look at the broader philosophical and paradigmatic frameworks that underpin and guide research practices (Alele & Malau- Aduli, 2023). These frameworks, encompassing research paradigms and philosophies, serve as the theoretical lenses through which researchers view the world, formulate questions, and select appropriate methods. 2.4.2 Interpretivism Another prominent research philosophy is interpretivism or constructivism, which emerged as a critique of positivism (Yong et al., 2021). Interpretivism posits that the social world can be understood in multiple ways by different individuals, highlighting the crucial role of researchers in observing and interpreting this world (Ali, 2023). It is grounded in constructivist ontology, which asserts that reality is not discovered but constructed through human interactions and social processes (Goldkuhl, 2012). Researchers conducting interpretivist studies are often required to adopt an empathetic viewpoint, allowing them to better understand the perspectives of the research subjects. This reflects a fundamental epistemological assumption of interpretivism, namely that knowledge is subjective and arises from interactions between the researcher and the par- ticipants (Saunders, 2009). The associated axiological perspective acknowledges the re- searcher’s participation in the data generation process. However, it also emphasises the importance of consciously listening to the participants and evaluating their input without contaminating it with personal biases or values, ensuring the reliability of the findings (Luyt et al., 2012). This approach is typically used to study people and their social interactions, contrasting with positivist approaches that often focus on natural sciences and non-human objects such as computers (Saunders, 2009). However, it also emphasises the importance of consciously CHAPTER 2. RESEARCH METHODOLOGY 20 listening to the participants and evaluating their input without contaminating it with personal biases or values, ensuring the reliability of the findings. 2.4.3 Pragmatism The two approaches discussed above, positivism and interpretivism, represent opposite ends of the research philosophy spectrum, each with clear advantages and limitations. However, many research studies do not fall entirely into one of these categories, as they often involve exploring both rigid and subjective realities. This calls for a more flexi- ble approach that combines elements from multiple philosophies, where pragmatism as a research philosophy comes in. Pragmatism asserts the possibility of working with diverse assumptions, including onto- logical and epistemological perspectives from both positivist and interpretivist stances (Saunders, 2009). As noted by Alghamdi and Li (2013), pragmatism is not restricted to any one system of philosophy or reality. Instead, it allows researchers the flexibility to select methods and techniques most suitable for addressing their research questions, whether these are quantitative or qualitative, and regardless of the type of data being analysed. This philosophical stance emphasises the practical outcomes of research and supports the use of mixed-methods approaches, combining the rigour of scientific inquiry with the depth of interpretive understanding. Pragmatism, therefore, serves as a valuable framework for studies that require methodological diversity and adaptability to address complex research problems effectively. 2.4.4 Critical Social Theory Critical Social Theory (CST) refers to a broad range of theoretical approaches that critique existing social conditions and power structures with the goal of emancipating humans and the planet from the hostile consequences of modernity (Celikates & Flynn, 2023; Manners, 2020). The CST is rooted in the Frankfurt School’s “Critical Theory” that emerged in the 1930s as interdisciplinary research approach that combines philosophy and social science with the emancipatory objectives of various social and political movements. The CST examines how different dimensions of domination and oppression, such as economic, racial, gendered, and political, shape society and seeks not just to understand these dynamics but to challenge and change them (Browne, 2000). The ontological assumption of CST can be described as historical realism or critical real- ism, however, in a different sense from what is presented by Bhaskar (2013). This means that CST does believe in a reality (particularly social reality) that is “real” and has objec- CHAPTER 2. RESEARCH METHODOLOGY 21 tive consequences; this “reality” is not fixed or immutable; it is shaped by historical and social forces, especially by relations of power and oppression (Scotland, 2012). Guba and Lincoln (1994:109) characterise the ontology of this research philosophy as “historical re- alism – a reality that is shaped by social, political, cultural, economic, ethnic, and gender values, crystallised over time”. In simpler terms, what we take to be “reality” (especially in the social world) is the product of history and power dynamics – for example, racial categories or gender roles are real in their consequences, but are historically constructed rather than being natural phenomena. In terms of epistemological assumptions, CST is described as transactional or subjectivist, meaning that the inquirer and the object of inquiry are not independent, but interactively linked. As a result, the values of the investigator inevitably influence the findings (Guba & Lincoln, 1994). Within this framework, the researcher’s values serve as primary resources driving the research philosophy. Rather than attempting to remain a neutral observer, the researcher explicitly uses their values to uncover and interpret social truths. According to Cohen (as cited in Scotland, 2012:35), “what counts as knowledge is determined by the social and positional power of advocates of that knowledge”. Concerning axiological assumptions, CST places values at its core, explicitly addressing normative questions such as “What is intrinsically worthwhile?” (Scotland, 2012:13). This inherently normative orientation emphasises respect for cultural norms and societal values (Kivunja & Kuyini, 2017; Scotland, 2012). Central to CST are values of emancipation and social justice, which form the foundational components of its research paradigm. Methodologically, CST favours approaches that promote critical insight, active participa- tion, and social change. Since the ultimate goal of CST is not merely to study society but to transform it, the methodology employed is often described as dialogic (Kivunja & Kuyini, 2017). Ideology critique represents a core methodological practice, enabling crit- ical analysis and interrogation of underlying values and assumptions to expose injustice (Scotland, 2012). Furthermore, CST frequently incorporates action research as a means of actively seeking to change social realities (Dieronitou, 2014). 2.4.5 Critical Realism Critical realism (CR) is a relatively recent and comprehensive philosophy of science de- veloped by the English philosopher Roy Bhaskar as an alternative to the prevailing philo- sophical paradigms of positivism, interpretivism, and pragmatism (Lawani, 2021). CR comprises two primary components: transcendental realism and critical naturalism, ad- dressing the philosophy of science and social science, respectively (Bhaskar, 2013; Bhaskar, 2014). Transcendental realism is focused on the ontology of the natural world, asserting that real mechanisms exist beyond empirical observation and hermeneutic interpretation. CHAPTER 2. RESEARCH METHODOLOGY 22 Critical naturalism extends this realist ontology to the social realm, suggesting that un- derlying social structures and mechanisms also exist beyond immediate observation and interpretation (Bhaskar, 2013; Zhang, 2023). Ultimately, critical realism asserts that an objective reality exists (as in positivism), but our knowledge of this reality is inherently partial and shaped by theoretical frameworks. Consequently, critical realism emphasises the necessity of exploring deeper structures and causal mechanisms that operate or exist beyond direct empirical observation. Similarly to pragmatism, CR advocates for mixed-methods research. However, unlike pragmatism, which views ontological and epistemological assumptions as separable from research methods, CR explicitly associates these methods with its philosophical positions. Specifically, CR maintains a layered ontology consisting of three distinct domains: the real, the actual, and the empirical (Holmén, 2020). The real domain refers to the deep structures, properties, and mechanisms of entities that exist independently of observation. The actual domain encompasses the actual events generated by these underlying mecha- nisms, irrespective of whether they are observed. Finally, the empirical domain represents the observable experiences or events perceived by individuals. CR thus rejects simplistic reductions of reality to only what can be directly observed. Instead, it asserts that many critical aspects of social reality, such as social structures and class relations, are real but not directly observable, requiring inference based on their observable effects. Methodologically, CR advocates for a pluralistic approach to research, employing any methods that effectively identify and explain causal mechanisms. Given that CR ontology is complex, with multiple layers of reality, both quantitative and qualitative methods can be combined to capture different aspects of a phenomenon (Sobh & Perry, 2006). This methodological flexibility allows researchers to investigate not only observable patterns, but also the underlying structures and causal relationships that shape social and natural phenomena. 2.5 Positioning the Present Study This study is firmly grounded in a positivist philosophical framework, which emphasises objectivity, empirical evidence, and hypothesis-driven inquiry. The ontological assumption underlying this research is that an objective reality exists, wherein morphological patterns (the structure of words and how morphemes combine) in isiZulu can be systematically measured and analysed. This perspective aligns with the positivist view that reality is independent of human perception and can be understood through structured observation and experimentation. Epistemologically, the study adopts a stance that valid and reliable knowledge or insights about the relationship between segmentation granularity and downstream task perfor- CHAPTER 2. RESEARCH METHODOLOGY 23 mance can be acquired through empirical investigation. By leveraging comprehensive and repeatable experimentation, the research seeks to uncover insights around the supervised morphological segmentation of isiZulu texts using a rule-based approach as a pipeline to generate the syntactic data of different granularity. This approach ensures that findings are grounded in measurable evidence rather than subjective interpretation. Axiologically, the study reflects the positivist emphasis on objectivity and detachment. Personal values and biases are consciously minimised to maintain the integrity of the re- search process and outcomes. Ethical considerations are central, with transparency in data selection, methodological soundness, and the reproducibility of results being consistently prioritised in order to uphold the study’s credibility. 2.6 Hypothetico-Deductive Methodology Since this study assumes the positivist paradigm approach, the most common method for such in scientific research is the hypothetico-deductive methodology. This is an ap- proach that is regarded to be the heart of scientific inquiry (Fosl & Baggini, 2020). This methodology comprises two primary components: the hypothetico part, where an explana- tory hypothesis is formulated to address the research problem, and the deductive part, where testable claims are derived from the hypothesis and empirically examined. The results after the examination can be a substantiated or falsified hypothesis (Park et al., 2020). According to Fosl and Baggini (2020:47) the principle governing this procedure is to, “start with a hypothesis and a set of given conditions, deduce what facts follow from them and then conduct experiments to see if those facts hold and hence whether the hy- pothesis is false”. Sekaran and Bougie (2016) and Tariq (2015) present this methodology as a seven-step process, as detailed below. 1. Identify a broad problem area The first step in the research process is to identify a general area of interest or concern that serves as the foundation for the research project. This step involves scanning existing literature, identifying gaps, and noting practical challenges in the field. 2. Define the problem statement To conduct a scientific research study, one needs to have a definite aim or purpose (Sekaran & Bougie, 2016). In this step, a clear and concise problem statement should be articulated, which highlights the issue that the study seeks to address and provides the foundation for the research objectives and questions. Preliminary information is gathered to understand the factors contributing to the problem, narrowing the broad area into a specific problem statement. CHAPTER 2. RESEARCH METHODOLOGY 24 3. Develop hypotheses Based on observations and existing theories, hypotheses are developed to guide the study. These serve as predictions to be examined through empirical investigation (Popper, 2002). In a scientific study, the conducted hypotheses must meet two key criteria: (a) Testability : The hypothesis must be capable of being empirically tested. (b) Falsifiable: The hypothesis should be disprovable, because hypotheses cannot be confirmed, but can only be corroborated until contradicted by new findings. 4. Determine measures This step involves identifying and selecting appropriate tools, instruments, or meth- ods to operationalise (quantify or observe) the variables defined in the research hypothesis. Trochim et al. (2016) note that in this process, validity and reliability are critical considerations to ensure that the measures capture the constructs being studied effectively. 5. Data collection Once the measures have been established, researchers proceed with gathering data using suitable methodologies such as surveys, experiments, interviews, or observa- tions. Data collection should follow ethical guidelines and adhere to a well-designed protocol to ensure consistency and minimise bias (Fowler, 2014). 6. Data analysis After collecting the data, researchers analyse it using numerical analysis such as sta- tistical approaches or qualitative techniques appropriate to the nature of the study. This step involves examining patterns, testing hypotheses, and drawing conclusions based on the data’s evidence (Sekaran & Bougie, 2016). 7. Interpretation of data The final step in this research process involves interpreting the results of the analysis in the context of the original hypotheses and the broader problem area. The findings provide insights into whether the hypothesis is supported or not. At this stage the researcher has to discuss the implications, address potential limitations, and provide recommendations for future research. In the event that the hypothesis was not supported, the researcher should critically evaluate the reasons behind the found outcome and refine the theory for retesting (Sekaran & Bougie, 2016). CHAPTER 2. RESEARCH METHODOLOGY 25 2.6.1 Application in this Research The hypothetico-deductive methodology serves as the guiding framework for this research, where multiple hypotheses are presented that align with the main objective of the study, which is: To investigate supervised surface segmentation of isiZulu text using synthetic data generation. This is achieved through following five sub-objectives presented in Chapter 1 Section 1.4.2. The following section outlines how each step of the hypothetico-deductive approach is applied in this study: 1. Identifying a broad problem area As discussed in Chapter 1, isiZulu’s complex agglutinative morphology allows for a single stem to generate numerous word variations, significantly increasing vocabulary size. This vocabulary explosion poses challenges for language modelling, as many word variations may be absent from training data, leading to out-of-vocabulary (OOV) issues where models struggle to recognise or generate unseen tokens effec- tively. One potential solution to mitigate data sparsity is morphological segmentation, which reduces vocabulary size by breaking words into their constituent morphemes. Morphological segmentation can be either through canonical or surface segmenta- tion, where the former breaks the words into underlying morphemes, and the latter, which divides them into morpheme-based substrings (Cotterell et al., 2016b). When canonical segmentation is used, it is simpler for one to determine the “correct” segmentation, since the canonical forms of morphemes of the language are well un- derstood from a linguistic point of view. Less can be said about this in a surface segmentation, because the high degree of morphophonological alternation makes it less clear what a suitable segmentation should be and where to place the morpheme boundary, when looking to ensure optimal results of a downstream task. Both canonical or surface segmentation can be achieved either through rule-based or machine learning-based approaches. Rule-based systems rely on expert-crafted linguistic rules, while machine learning-based methods infer patterns from data, which can be either labelled (supervised) or unlabelled (unsupervised). Studies have shown that supervised approaches generally outperform unsupervised methods in morphological segmentation (Belth, 2024; Ruokolainen et al., 2016; Wang et al., 2016b). Despite the advantages of machine learning-based segmentation, obtaining high- quality annotated data remains a significant challenge, especially for low-resource languages like isiZulu. The lack of pre-existing labelled datasets makes training supervised models particularly difficult, necessitating alternative data generation. CHAPTER 2. RESEARCH METHODOLOGY 26 2. Defining the problem statement The study narrows its focus to the specific problem of data sparsity caused by isiZulu’s morphological complexity. This sparsity negatively impacts the efficiency of language models, particularly in NLP systems, where poor handling of morpho- logical variation results in suboptimal results. Morphological segmentation has the potential to alleviate this issue; however, the problem of annotated data in low- resource languages makes it even harder to gain many insights about morphological segmentation; the problem of data remains a more significant issue in investigat- ing and gaining insight into its effects. In this sense, the present study explores using a hybrid approach to morphological surface segmentation, using a rule-based system, ZRG system, to generate synthetic morphological surface segmented data of different segmentation granularity, to use this data to train supervised machine learning morphological segmenters, and to investigate their performance intrinsically and extrinsically. 3. Developing hypotheses This study is based on two hypotheses: (a) In developing a hybrid approach to supervised surface segmentation of isiZulu, using a machine learning approach as a foundation will ensure robustness and efficiency, while a combination of synthetically generated and automatically annotated data would address the requirement of machine learning approaches for large amounts of training data. i. The hybrid approach will exceed the rule-based approach in robustness; ii. The hybrid approach will exceed the rule-based approach in efficiency; iii. The hybrid approach will reach a sufficient level of accuracy in comparison to the rule-based approach. (b) The hybrid approach will result in one or more segmenters that improve per- formance in a downstream task, such as machine translation. 4. Determining measures In determining the appropriate measures for evaluation, this study employs a two- fold approach to assess both the efficiency and effectiveness of the proposed morpho- logical segmentation system. Efficiency is evaluated by examining the ease and speed with which the supervised machine learning model performs segmentation, partic- ularly in comparison to the rule-based approach, which is used as the pipeline for generating data. This involves measuring the processing time of both systems and making a comparison as to which system is the fastest “efficient” in segmentation. Effectiveness, on the other hand, will be measured in three key ways. Firstly, how robust are the supervised surface segmentation systems in generating new tokens that are not in the vocabulary (tokens that the ZRG could not segment)? The CHAPTER 2. RESEARCH METHODOLOGY 27 second and third measures of effectiveness involve evaluating the developed system intrinsically and extrinsically. In the intrinsic evaluation, the system output is di- rectly assessed in terms of predefined standards or criteria that relate to the system’s functionalities or objectives (Jones & Galliers, 1995; Resnik & Lin, 2010). In the sur- face segmentation context, the intrinsic evaluation assesses how close or similar the system-generated morphs (hypothesis) are to pre-generated (constructed) morphs (ground truth). In text sequential problems, this assessment is usually conducted through quantitative n-grams based metrics such as BLEU (Papineni et al., 2002), CHaRacter-level F-score) (Popović, 2015) and classification metrics such as preci- sion, recall and F1-score. Similarly, this work employs these metrics to evaluate how similar the generated morphs are to the pre-generated morphs. When conducting extrinsic evaluation, the developed system is treated as an en- abling technology and is used to assess its impact on another system, a downstream task (Resnik & Lin, 2010). In this case, the impact of segmentation models on a downstream NLP application, Neural Machine Translation (NMT), is examined. The effectiveness of different segmentation granularities is assessed by evaluating the per- formance of the NMT system when trained on segmented versus unsegmented data. This comparison provides insight into the extent to which morphological segmenta- tion improves translation quality, as measured by BLEU and chrF scores. Through this comprehensive evaluation framework, the study ensures that both segmentation models and their practical application in the NLP downstream task are well assessed with a view to gain useful insights. 5. Data collection Two datasets are utilised in this research, which may also fall under the data collec- tion step: (a) A dataset for training the morphological segmenters across different levels of granularity. (b) A parallel isiZulu-English dataset that is filtered and preprocessed. The isiZulu sentences are segmented using the developed segmenters, with the unsegmented sentences serving as a baseline. This allows for a direct comparison of the effectiveness of the segmentation strategies. In a broader sense, data collection in this study extends beyond acquiring raw data sets. It involves conducting multiple experiments across different datasets to anal- yse the impact of morphological segmentation on model performance. This process includes training, validation, and testing, where the data is systematically split to ensure reliable evaluation. Insights are gathered through performance metrics, model robustness, loss trends, and visualisation graphs, which provide a deeper understand- ing of the model’s learning behaviour, token or morpheme predictions, and overall CHAPTER 2. RESEARCH METHODOLOGY 28 system effectiveness. These experimental observations serve as a critical component of data collection, enabling a comprehensive assessment of the developed system. 6. Data analysis Statistical and computational methods are employed to analyse the data that has been collected. This includes intrinsic evaluations being conducted to measure the segmentation models’ quality, while extrinsic evaluations assess the impact of seg- mentation on NMT performance. Comparative analyses are carried out across dif- ferent granularity levels to identify trends and draw meaningful conclusions. 7. Interpretation of data The results are examined to determine whether the hypothesis is supported. Key findings, such as the relationship between segmentation granularity and their im- pact on translation quality, are analysed. The implications for future research and applications in low-resource language NLP are discussed, addressing the proposed approach’s strengths and limitations. By applying the hypothetico-deductive methodology, this study ensures a systematic and structured approach to exploring how morphological segmentation can alleviate data spar- sity challenges in isiZulu. Each step is aligned with the study’s objectives, ensuring that the research generates valid, reliable, and actionable insights. Subsequent chapters will provide in-depth discussions of these steps and their execution within the study’s context. Figure 2.1 summarises the research methodology that allows for the systematic investiga- tion of the research problem and the structured pursuit of the study’s objectives. CHAPTER 2. RESEARCH METHODOLOGY 29 F ig u re 2 .1 : S u m m ar y of th e re se ar ch m et h o d o lo g y. CHAPTER 2. RESEARCH METHODOLOGY 30 2.7 Chapter Summary This chapter outlined the research methodology employed to investigate the possibility of using a hybrid rule-based ZRG and supervised machine learning approaches to conduct morphological surface segmentation of isiZulu text with different granularity levels and gain salient insights through intrinsic and extrinsic evaluation. Grounded in the positivist paradigm, the study adopts the hypothetico-deductive methodology, ensuring a structured and rigorous approach to hypothesis-driven scientific inquiry. The chapter introduced foundational research philosophies and paradigms, emphasising how ontological, epistemological, axiological, and methodological assumptions inform the research process. These philosophical underpinnings set the stage for the study’s method- ological choices, particularly its reliance on objectivity, empirical evidence, and systematic observation. The positivist paradigm was identified as the guiding framework, aligning with the study’s focus on measurable phenomena and hypothesis testing. The hypothetico-deductive methodology was presented as the central approach for this research, detailing its steps: identifying a broad problem area, defining a specific problem statement, developing hypotheses, determining measures, collecting and analysing data, and interpreting results. Each step was explicitly contextualised within the study ob- jectives, demonstrating how the methodology supports a comprehensive test of the two formulated hypotheses. This chapter highlighted key components of methodology, including the selection of in- trinsic and extrinsic evaluation metrics, the collection and preparation of datasets, and the use of statistical and computational methods for data analysis. By ensuring align- ment between the research questions, methodology, and objectives, the study provides a pipeline to investigate the hypothesis in a systematic and scientifically governed approach and, hence, achieving the primary objective. In summary, this chapter has provided a comprehensive framework for conducting the re- search, bridging philosophical considerations with practical methodological choices. Em- bedding the study within a structured and hypothesis-driven framework helps to ensure the reliability and validity of the findings. Subsequent chapters will build on this foundation, delving into the implementation, experimentation, and evaluation processes, ultimately addressing the broader implications of the research for low-resource languages like isiZulu. Chapter 3 Morphological Segmentation 3.1 Introduction This chapter sets out to provide a literature review on morphological segmentation with a number of salient concepts that are associated with it. The review begins with a linguistic background of Nguni languages with a particular focus on isiZulu, covering aspects of phonology, phonetics, and morphology. The chapter then examines different strategies and techniques that are currently employed in the literature to conduct morphological segmentation. This ranges from rule-based approaches to data-hungry approaches such as machine learning and deep learning techniques. The chapter concludes by looking into different metrics that are commonly used to measure the performance of morphological segmentation, and concludes with the trends and open questions that are presented within this domain. 3.2 Background and Linguistic Characteristics of isiZulu 3.2.1 Background and History of isiZulu This section provides a brief background and describes the linguistic features of isiZulu (and its related languages) that require the kind of segmentation described in this study. Zulu or isiZulu (as an endonym) is one of the 12 official languages in South Africa, and it is considered to be the most widely spoken indigenous language in the country. IsiZulu speak- ers comprise approximately a quarter of the population, with around 15.1 million home language speakers out of the population of 62 million people in South Africa (StatsSA, 2022a). The majority of isiZulu native speakers are concentrated in KwaZulu-Natal, where 31 CHAPTER 3. MORPHOLOGICAL SEGMENTATION 32 the language is predominantly spoken in 80% of households, followed by Mpumalanga, where it is the primary language in 27.8% of households, as shown in Figure 3.1. Apart from South Africa, isiZulu is also spoken in other Southern African countries, includ- ing Eswatini, Mozambique, and Namibia, although with smaller populations of speakers (Asante & Mazama, 2009). Figure 3.1: Distribution of isiZulu Speakers Across South African Provinces, Based on Statistics 2022 Data IsiZulu is part of the Nguni language family, which is