PLAGIARISM DECLARATION 

I ______________________________________________________________________________ 

(full name and surname and student number) hereby declare that this assignment / paper / project / 

portfolio is my own work. 

I further declare that: 

1. the text and bibliography reflect the sources I have consulted, and

2. where I have made reproductions of any literary or graphic work(s) from someone else, I

have obtained the necessary prior written approval of the relevant

author(s)/publisher(s)/creator(s) of such works and/or, where applicable, from the Dramatic,

Artistic and Literary Rights Organisation (DALRO).

3. sections with no source referrals are my own ideas, arguments and/or conclusions.

Signature: ____________________________ Student number: ____________________________ 

Date: __________________________ 

Sthembiso Nkosentsha Mkhwanazi 

19 March 2025


Acknowledgements

I begin by thanking God for His grace in my life, which has enabled me not only to will

but also to do this work (Philippians 2:13). Throughout it all, I remain deeply grateful

for the incredible people I’ve had the privilege to meet - serendipitously - many of whom

have contributed significantly to this study in ways that words may never fully capture. I

will, however, attempt to acknowledge a few by name.

First and foremost, I wish to express my sincere gratitude to Dr Laurette Marais, my main

supervisor, for her support throughout this journey. Her guidance, constructive feedback,

and steady encouragement have been invaluable in the development and completion of

this dissertation, which I’m glad to present after a long and challenging process.

My heartfelt thanks also go to Prof. Roelien Goede, who kindly assumed the role of co-

supervisor for this study. Her expert advice, supportive feedback, and readiness to assist

despite her demanding schedule have been invaluable throughout this journey.

I am deeply grateful to my family for their enduring love and unwavering support, es-

pecially during moments of doubt, even when they did not fully understand the “why”

behind this path. To my mother and my siblings Xolile, Sandile, and Akhona, I just wanna

say, “ngiyabonga ngakho konke”.

To my colleagues at the Council for Scientific and Industrial Research (CSIR), thank you

for your collaboration, encouragement, and shared pursuit of ‘Touching lives through in-

novation’, and for being EPIC. Your support has made this process even more meaningful.

I also wish to acknowledge the intellectually stimulating environments of the seminars

where this work was preliminarily presented, including DHASA, AI Expo Africa, IndabaX

South Africa, and Hundzula Retreat. These platforms not only allowed me to share my

work but also provided valuable discussions and feedback that refined this study.

Finally, I extend my deepest gratitude to the CSIR as an organisation for financially

sponsoring this study, and to the National Integrated Cyber Infrastructure System (NICIS)

and its Centre for High Performance Computing (CHPC) for providing the infrastructure

that enabled the technical execution of this research.

i


Abstract

IsiZulu, one of South Africa’s most widely spoken languages, is classified as a low-resource

language, especially regarding digital tools. As part of the Nguni family, isiZulu exhibits

complex morphology and conjunctive orthography. These features result in data sparsity,

as a single root or stem may appear in numerous morphological variants, complicating

language modelling. This underscores the importance of morphological segmentation, a

natural language processing NLP task that decomposes words into their smallest meaning-

ful units (morphemes). Rule-based methods yield high accuracy in low-resource contexts

but typically lack robustness and are costly to develop. Conversely, machine learning ap-

proaches require large, high-quality datasets, often unavailable for low-resource languages.

To address these challenges, this study employs a hybrid approach using a rule-based sys-

tem, the isiZulu Resource Grammar (ZRG), to generate synthetic datasets with varying

segmentation granularities. These datasets underwent data augmentation through syntac-

tic tree manipulation, significantly increasing their size and diversity. Subsequently, this

data trained supervised machine learning models: Conditional Random Fields (CRF),

Long Short-Term Memory (LSTM), and Transformer-based models—for morphological

segmentation. The effectiveness of these models was assessed intrinsically, using precision,

recall, F1 score, BLEU, and chrF, and extrinsically, by evaluating their impact on Neural

Machine Translation (NMT) quality for isiZulu-English translation. Intrinsic evaluation

showed that the Transformer model consistently outperformed the CRF and LSTM mod-

els, achieving segmentation accuracy above 0.9 across all metrics and granularity styles.

Additionally, the hybrid approach demonstrated superior robustness, effectively handling

out-of-vocabulary (OOV) words and performing segmentation 30 times faster than ZRG

alone. Extrinsic evaluation confirmed that segmentation improved translation quality, with

Segmenter Two achieving the highest BLEU score (0.235), representing a 25.0% improve-

ment over the unsegmented baseline (0.188). These findings highlight the effectiveness of

integrating rule-based and machine learning approaches for morphological segmentation,

offering a scalable solution for processing low-resource languages with complex morpholo-

gies such as isiZulu in NLP applications.

Keywords: Agglutinative Languages; Morphological Segmentation; isiZulu; Supervised

Segmenter Learning; Rule-Based Segmenter.

ii


Contents

1 Introduction 1

1.1 Introduction to the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background and Contextualisation . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Research Questions, Objectives and Hypotheses . . . . . . . . . . . . . . . . 7

1.4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.1 Data Collection and Use . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.2 Privacy and Cultural Sensitivity . . . . . . . . . . . . . . . . . . . . 10

1.6.3 Transparency and Reproducibility . . . . . . . . . . . . . . . . . . . 10

1.7 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Research Methodology 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 The Concept of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii


CONTENTS iv

2.3 Research Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Ontological Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Epistemological Assumptions . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 Axiological Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Research Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Positivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Interpretivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.3 Pragmatism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.4 Critical Social Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.5 Critical Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Positioning the Present Study . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Hypothetico-Deductive Methodology . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 Application in this Research . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Morphological Segmentation 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Background and Linguistic Characteristics of isiZulu . . . . . . . . . . . . . 31

3.2.1 Background and History of isiZulu . . . . . . . . . . . . . . . . . . . 31

3.2.2 Linguistic Characteristics of isiZulu . . . . . . . . . . . . . . . . . . 34

3.2.3 Morphological Characteristics: Overview . . . . . . . . . . . . . . . . 38

3.3 Morphological Segmentation: Overview . . . . . . . . . . . . . . . . . . . . 43

3.3.1 Introduction to Morphological Segmentation . . . . . . . . . . . . . 44

3.3.2 Canonical Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 44


CONTENTS v

3.3.3 Surface Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Approaches to Morphological Segmentation . . . . . . . . . . . . . . . . . . 45

3.4.1 Rule-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Statistical Machine Learning Approaches . . . . . . . . . . . . . . . 52

3.4.3 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 Morphological Segmentation Metrics . . . . . . . . . . . . . . . . . . . . . . 74

3.5.1 Intrinsic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5.2 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4 Data Preparation 82

4.1 Ukwabelana Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1.1 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1.2 Development and Annotation Process . . . . . . . . . . . . . . . . . 84

4.1.3 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 National Centre for Human Language Technology Text Corpora . . . . . . . 84

4.2.1 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.2 Development and Annotation Process . . . . . . . . . . . . . . . . . 85

4.2.3 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 86

4.3 Reflection on ZRG’s Limitations and Advantages . . . . . . . . . . . . . . . 86

4.3.1 Limitations of the ZRG . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3.2 Strengths of the ZRG . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4.2 Data Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


CONTENTS vi

4.5 Data Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.5.1 Dataset Preparation and Batching . . . . . . . . . . . . . . . . . . . 94

4.5.2 Parallel Processing with Docker . . . . . . . . . . . . . . . . . . . . . 94

4.5.3 Runtime Parsing and Output Management . . . . . . . . . . . . . . 94

4.5.4 Error Handling and Output Consolidation . . . . . . . . . . . . . . . 95

4.5.5 Parsing Output Example . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5.6 Processing Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.5.7 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.6 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.6.1 Workflow for Data Augmentation . . . . . . . . . . . . . . . . . . . . 101

4.6.2 Augmentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 102

4.7 ZRG Linearisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.7.1 Different Linearisation Strategies . . . . . . . . . . . . . . . . . . . . 108

4.7.2 Application of Linearisation Strategies . . . . . . . . . . . . . . . . . 109

4.8 Final Dataset Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5 Design of Models 113

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 Models Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.2.1 Criteria for Model Selection . . . . . . . . . . . . . . . . . . . . . . . 114

5.2.2 Models Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.3 Models Design and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.3.1 Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.3.2 Long Short-Term Memory Models . . . . . . . . . . . . . . . . . . . 132


CONTENTS vii

5.3.3 Conditional Random Field Models . . . . . . . . . . . . . . . . . . . 134

5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6 Experiment Evaluation 141

6.1 Model Performance Analysis: Training and Validation . . . . . . . . . . . . 142

6.1.1 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.1.2 Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.2 Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2.1 Model Selection for Downstream Evaluation . . . . . . . . . . . . . . 148

6.2.2 Data Augmentation Impact . . . . . . . . . . . . . . . . . . . . . . . 149

6.2.3 Investigating the Robustness of the Transformer Models . . . . . . . 150

6.2.4 Investigating Efficiency of the Transformer Model . . . . . . . . . . . 152

6.3 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.3.1 Neural Machine Translation Model . . . . . . . . . . . . . . . . . . . 155

6.3.2 Neural Machine Learning Evaluation . . . . . . . . . . . . . . . . . . 158

6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7 Findings, Conclusion, and Future Research 167

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.2 Summary of the Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.3 Findings Regarding Objectives and Research Questions . . . . . . . . . . . 170

7.3.1 Summary of Achievements in Relation to the Objectives . . . . . . . 172

7.4 Evaluating the Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.4.1 First Hypothesis: Performance of the Hybrid System . . . . . . . . . 174

7.4.2 Second Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.5 Answering the Research Questions . . . . . . . . . . . . . . . . . . . . . . . 179


CONTENTS viii

7.5.1 Research Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.5.2 Research Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.6 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

A Appendix 209

A.1 Models Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

A.1.1 Hyper-Parameters for Unaugmented Models . . . . . . . . . . . . . . 209

A.1.2 Hyper-Parameters for English-to-IsiZulu Translation Models . . . . . 209


List of Figures

2.1 Summary of the research methodology. . . . . . . . . . . . . . . . . . . . . . 29

3.1 Distribution of isiZulu Speakers Across South African Provinces, Based on

Statistics 2022 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Official South African Southern Bantu languages Guthrie group. . . . . . . 33

3.3 A simple diagram of the encoder-decoder. . . . . . . . . . . . . . . . . . . . 64

4.1 Parse abstract syntax tree generated by ZRG with default segmentation. . . 97

4.2 The first found subtrees from the bigger tree. . . . . . . . . . . . . . . . . . 99

4.3 The second found subtrees from the bigger tree. . . . . . . . . . . . . . . . . 100

4.4 The abstract syntax tree for “amantombazane amahle amathathu ahlala

ekhaya nomama wabo”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5 The abstract syntax tree for “ubaba kade enezimoto ezimbili ezimbi ezishayelwa

abafana bakhe abadlala ibhola”. . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.6 The abstract syntax tree for “abafana bakabhuti banezinja ezinenkani”. . . . 110

5.1 An adapted vanilla Transformer architecture. . . . . . . . . . . . . . . . . . 130

6.1 Transformer: training loss over epochs for different Segmenters. . . . . . . . 143

6.2 LSTM: training loss over epochs for different segmenters. . . . . . . . . . . 144

6.3 Transformer: validation loss over epochs for different segmenters. . . . . . . 145

6.4 LSTM: validation loss over epochs for different segmenters. . . . . . . . . . 146

ix


LIST OF FIGURES x

6.5 Training loss across epochs: Translation models with varying segmentation

strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.6 Validation loss across epochs: Translation models with varying segmenta-

tion strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.7 Validation loss across epochs: for the Unsegmented model. . . . . . . . . . . 161


List of Tables

3.1 Distribution of languages spoken in South Africa. . . . . . . . . . . . . . . . 34

3.2 Noun classes in isiZulu showing singular and plural examples. . . . . . . . . 42

3.3 Contingency table: confusion matrix. . . . . . . . . . . . . . . . . . . . . . . 75

4.1 Manipulation of terms and resulting sentences. . . . . . . . . . . . . . . . . 107

4.2 Comparison of linearisation strategies for the sentence “abafana bakabhuti

banezinja ezinenkani”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.1 Summary of data splitting for training, validation, and testing. . . . . . . . 126

5.2 Transformer model architecture parameters. . . . . . . . . . . . . . . . . . . 126

5.3 Hyper-parameter search space configuration. . . . . . . . . . . . . . . . . . . 127

5.4 Final model configurations for segmenters across different linearisation strate-

gies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.5 LSTM model default configuration. . . . . . . . . . . . . . . . . . . . . . . . 133

5.6 Hyper-parameter search space. . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.7 LSTM optimal hyper-parameters for Segmenters One, Two, and Three. . . 134

5.8 CRF model base parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.9 Grid search parameter ranges for GridSearch. . . . . . . . . . . . . . . . . . 137

5.10 Optimal CRF parameters across different segmentation schemes. . . . . . . 139

6.1 Morphological segmentation examples: input words with target and Seg-

menter One predicted segmentations. . . . . . . . . . . . . . . . . . . . . . . 146

xi


LIST OF TABLES xii

6.2 Model performance comparison across different segmentation approaches

(BLEU and chrF scores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.3 Comparison of precision, recall, and F1 scores across different architectures

and segmentation styles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4 Comparison of key metrics scores between Transformer models trained on

unaugmented data and augmented data. . . . . . . . . . . . . . . . . . . . . 149

6.5 Segmentation results from different segmenters. . . . . . . . . . . . . . . . . 151

6.6 Gold standard segmentations and failure analysis. . . . . . . . . . . . . . . . 151

6.7 Evaluation metrics for different segmenters. . . . . . . . . . . . . . . . . . . 152

6.8 Execution times for segmenters using ZRG and Transformer models. . . . . 153

6.9 Optimal hyper-parameters for translation models across different segmen-

tation approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.10 IsiZulu–English translation metrics for Segmentation models (test set). . . . 162

6.11 Translation quality metrics for different Segmentation models in English-

isiZulu translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.12 Translation scores for Segmentation models on updated FLORES (isiZulu–English).163

6.13 Translation quality metrics scores for different segmentation styles using

the updated FLORES dataset (English-isiZulu). . . . . . . . . . . . . . . . . 165

A.1 Model configurations for The Unaugmented Segmentation Models . . . . . . 209

A.2 Model Configurations for English-to-IsiZulu Translation With and Without

Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209


Table of abbreviations

A table containing a list of abbreviations that will be used throughout text.

ZRG isiZulu Resource Grammar

NMT Neural Machine Translation

NLLB No Language Left Behind

FLORES Few-shot Learning for Zero-shot Evaluation in Translation

CRF Conditional Random Fields

LSTM Long Short-Term Memory

GRU Gated Recurrent Units

BLEU Bilingual Evaluation Understudy

chrF Character F-score

MEMM Maximum Entropy Markov Model

OOV Out-of-Vocabulary

POS Part-of-Speech

GF Grammatical Framework

AST Abstract Syntax Tree

ML Machine Learning

HPO Hyper-Parameter Optimisation

GPU Graphics Processing Unit

PGF Portable Grammar Format

AI Artificial Intelligence

NLP Natural Language Processing

RNN Recurrent Neural Network

FFN Feed-Forward Network

Lin A Fine-grained Segmentation Strategy

Lin B Moderate Segmentation Strategy

Lin C Coarse-grained Segmentation Strategy

HPO Hyper-Parameter Optimisation

xiii


Chapter 1

Introduction

1.1 Introduction to the Study

Morphological segmentation is a Natural Language Processing (NLP) task that has shown

much potential to improve the automatic processing of morphologically complex languages

such as isiZulu (Creutz, 2006; Creutz & Lagus, 2007; Mager et al., 2022; Tukeyev et al.,

2020). Approaches to segmentation be divided into two broad categories: rule-based

systems and data-driven (machine learning) techniques, which can be further divided into

unsupervised, supervised, and semi-supervised methods. While fully supervised methods

typically achieve high performance, they rely on substantial amounts of annotated data

resources that are often unavailable for low-resourced languages like isiZulu. As a result,

researchers often have to rely on rule-based, unsupervised, or semi-supervised methods,

which require fewer annotations. However, these methods present notable limitations,

including lower accuracy, scalability challenges, and a heavy dependence on linguistic

expertise. These constraints hinder their integration into NLP pipelines that require high-

precision segmentation, emphasising the need for more effective supervised approaches

tailored to low-resource languages.

The present study aims to present insights into the use of an existing rule-based system to

generate synthetic data for investigating supervised morphological surface segmentation

of isiZulu at different granularity levels. The goal is to leverage the strengths of both

rule-based systems and machine learning approaches to overcome the weaknesses of each

approach.

Due to the lack of substantial annotated data for morphological segmentation, this study

uses isiZulu Resource Grammar (ZRG), which is a rule-based grammar developed using

the Grammatical Framework (GF) programming language, to generate morphologically

surface-segmented isiZulu text with three different segmentation styles. Additionally, the

1


CHAPTER 1. INTRODUCTION 2

study explores data augmentation as a strategy to expand the dataset further. ZRG

performs a deep morphosyntactic analysis of the text, which, while linguistically robust, is

computationally slow. Furthermore, as a rule-based system, it is brittle when encountering

words outside its predefined lexicon, limiting its generalisability.

To enhance segmentation beyond the limitations of rule-based methods, the research

adopts a machine learning-based approach to morphological surface segmentation, leverag-

ing supervised learning techniques. These methods are preferred due to their adaptability

and superior performance, especially in real-world applications where flexibility and ro-

bustness are critical.

1.2 Background and Contextualisation

IsiZulu is one of South Africa’s 12 official languages. It is a member of the Nguni lan-

guage group, which forms part of the larger Niger-Congo language family (Mesham et al.,

2021). Despite being widely spoken, isiZulu, along with other Nguni languages (isiX-

hosa, Siswati, and isiNdebele), is considered a low-resourced language. Indeed, isiZulu is

the most widely spoken language in South Africa, with approximately 15 million native

speakers, accounting for 24% of the country’s population (StatsSA, 2022b).

One of the notable characteristics of Nguni languages is their agglutinating morphology

and conjunctive orthography, as compared to disjunctive orthography, which is commonly

associated with other South African languages, more specifically the Sotho languages.

While Sotho languages share the agglutinating morphology with isiZulu and other Nguni

languages, they have disjunctive orthography, where morphemes are written with spaces

in between (Bosch & Pretorius, 2002; Taljard & Bosch, 2006).

IsiZulu is considered a low-resourced language due to limited access to linguistic resources

such as machine-readable texts. This scarcity of resources has significantly hindered the

development of technological tools that are necessary for ensuring the long-term digi-

tal preservation and vitality of isiZulu and similar languages (Loubser & Puttkammer,

2020). This challenge is particularly pronounced for resource-scarce agglutinative lan-

guages, where a single root or stem can combine with numerous morphemes to generate

hundreds of unique word forms. For Nguni languages like isiZulu, the use of conjunctive or-

thography further complicates this issue. The result is data sparsity, as even large corpora

may fail to encompass the full range of root and stem variations. This data sparsity cre-

ates significant challenges for language modelling approaches that rely on token identities.

Consequently, these challenges have motivated the present research into morphological

segmentation as a potential solution.


CHAPTER 1. INTRODUCTION 3

Morphological segmentation refers to the process of decomposing an orthographic word

(its written form) into its constituent morphemes, which are the smallest meaning-bearing

units of language (Creutz & Lagus, 2007). This technique has been found to be useful

in NLP applications such as machine translation (MT) (Mager et al., 2022; Tukeyev et

al., 2020) and automatic speech recognition (ASR) (Creutz, 2006), mainly when working

with low-resource languages that feature complex morphological structures. This entails

addressing the combinatorial explosion inherent in agglutinative morphology, where modi-

fications in morpheme attachment to a root or stem can completely alter a word’s meaning.

Such complexity often results in corpora containing limited variations of stems, leading to

language models that struggle with out-of-vocabulary (OOV) issues when morphological

segmentation is not applied. By breaking words into smaller, meaningful units, morpholog-

ical segmentation facilitates better generalisation and improves performance in language

modelling tasks for morphologically rich, resource-scarce languages.

The effort towards morphological segmentation has been divided into two goals: canonical

segmentation and surface segmentation (Cotterell et al., 2016a). Canonical segmentation

entails segmenting a word into its underlying morphemes, with the morphemes represented

in their canonical or lemmatised forms. The concatenation of these morphemes does not

necessarily result in the orthographic word, whereas with surface segmentation, the word

is divided into morpheme-based substrings (morphs) that can be combined to form the

orthographic word (Cotterell et al., 2016a).

There are two common techniques that are used for morphological segmentation: rule-

based and machine learning-based (ML) approaches (Anand Kumar et al., 2010; Gold-

smith, 2001). Rule-based segmentation relies on predefined rules based on expert knowl-

edge as well as the morphological structure of the language (Eiselen & Puttkammer,

2014a). The rule-based methods are considered to be time-consuming and expensive to

develop, and since they use rules to segment texts, they tend to be fragile when encounter-

ing new words not within the lexicon (Villegas-Ch et al., 2024). Despite these drawbacks,

rule-based methods remain helpful when developed more especially in languages with well-

documented morphological rules, serving as baselines or complementary tools when using

hybrid approaches (Grönroos et al., 2016; Zbib et al., 2012).

Machine learning-based segmentation, on the other hand, uses algorithms to learn from

data and identify morphemes based on stochastic and probabilistic methods (Goldsmith,

2001). Unlike rule-based systems, these techniques do not rely on explicitly defined rules;

instead, they are heavily dependent on the availability of sufficient training data, which

may compromise their reliability if inadequate. The primary machine learning methods are

supervised, unsupervised, semi-supervised, and reinforcement learning (Murphy, 2012a),

with the first three being the most commonly used in morphological segmentation.


CHAPTER 1. INTRODUCTION 4

The data set used in these machine learning-based approaches for morphological seg-

mentation can either be unsegmented or segmented by having predefined morphological

boundaries before being used to train the model. The former approach corresponds with

unsupervised segmentation, where the model learns segmentation patterns directly from

unannotated data, while the latter corresponds with supervised segmentation, where the

model is trained on data with predefined segmentation (Eskander et al., 2022). The third

approach, known as semi-supervised, combines these approaches. All these techniques re-

quire a substantial amount of data to learn the segmentation pattern accurately, and their

inherent adaptability enables them to perform well even on new words. This flexibility

renders them suitable for deployment in real-world applications where not all words are

present in the training set. However, obtaining sufficient annotated or unannotated data

is a significant challenge, particularly for low-resource languages.

Morphological segmentation plays a critical role in NLP, particularly for morphologically

complex and low-resource languages like isiZulu. The choice between surface and canonical

morphological segmentation is influenced by several factors, including the specific linguistic

characteristics of the language in question, the intended application of the segmentation,

and the availability of annotated data (Occhipinti, 2024; Rice et al., 2024).

When comparing surface segmentation with canonical segmentation, surface segmentation

presents distinct advantages, particularly for language modelling. Canonical segmentation

decomposes a word into its underlying morphemes in their canonical (or lemmatised)

forms (Cotterell et al., 2016a). While valuable for morphological analysis, these resulting

morphemes often do not correspond directly with the actual orthographic forms of the

language. This discrepancy can be problematic for language modelling, because the model

may struggle to map canonical forms back to their surface realisations. In contrast, surface

segmentation divides words into morpheme-based substrings that can be reassembled into

the original orthographic word. This method captures the surface forms of words, including

phonological variations that occur when morphemes combine.

A fundamental challenge in terms of surface segmentation is determining the optimal seg-

mentation granularity; that is the level of detail or fineness in which words are segmented

into their morphemic components. When surface segmentation is used, a word may have

multiple valid segmentations depending on the granularity level. However, determining

the optimal segmentation granularity for training language models remains an open re-

search question (Ataman & Federico, 2018; Meyer & Buys, 2022; Salesky et al., 2020).

If segmentation is too aggressive, language models may have to learn too much compo-

sitional semantics to accurately capture the meaning, while if it is too conservative, the

data sparsity problem may not be sufficiently addressed.

One way that has been explored to address this paradigm is to use methods such as Byte-

pair Encoding (BPE), where the hyper-parameter fixing granularity must be optimised


CHAPTER 1. INTRODUCTION 5

(Salesky et al., 2020). While BPE and other subword tokenisation techniques attempt to

address this issue, there is no consensus on the best granularity level for morphologically

rich languages (Salesky et al., 2020).

Most morphological (canonical or surface) segmentation work in the literature follows the

unsupervised approach, which is especially common for low-resource languages like isiZulu

that lack sufficient annotated data (Bergmanis & Goldwater, 2017; Creutz et al., 2007a;

Goldsmith, 2001; Hammarström & Borin, 2011; Mzamo et al., 2019a). This technique

is attractive because it can be applied to any language with adequate digital textual

data without needing expert annotations (Can & Manandhar, 2014; Ruokolainen et al.,

2016). However, despite its popularity, the unsupervised approach tends to achieve lower

accuracy in morphological segmentation due to the lack of linguistic guidance (Wang et

al., 2016b). This limitation reduces their effectiveness in downstream applications, where

precise segmentation may have a significant effect on performance.

An alternative to unsupervised morphological segmentation is supervised segmentation,

which relies on annotated datasets (Moeng et al., 2022; Pranjić et al., 2024; Puttkammer

& Du Toit, 2021). Supervised machine learning approaches are often preferred due to

their higher accuracy and their ability to incorporate linguistic knowledge (Belth, 2024).

Once trained, these models can efficiently segment words, generalising well to unseen data,

making them more robust and less brittle than rule-based approaches. However, a major

drawback of supervised segmentation is the need for large amounts of annotated data,

which is costly and labour-intensive to produce, especially for low-resource languages.

To mitigate this, researchers have explored semi-supervised approaches, which leverage

a small set of annotated data alongside unlabelled data to improve performance while

reducing annotation effort (Ruokolainen et al., 2014; Yusupujiang, 2018).

In contrast, rule-based morphological segmentation relies on predefined linguistic rules

crafted by experts to reflect a language’s morphological structure (Eiselen & Puttkam-

mer, 2014a; Villegas-Ch et al., 2024). While this approach can yield high accuracy in

well-defined linguistic contexts, it has several drawbacks. Rule-based systems are labour-

intensive, expensive to develop, and often struggle with adaptability (Dasgupta & Ng,

2007; Villegas-Ch et al., 2024; Westhelle et al., 2022). Their performance also deteriorates

when encountering new or unpredictable words, limiting their applicability in dynamic

linguistic environments. Scalability is another significant challenge; for languages with

complex morphologies, the large number of required rules can be overwhelming. Fur-

thermore, morphophonological alternations introduce additional complexities, making it

difficult to capture all variations through static rules (Joseph & Anto, 2015; Selvam &

Natarajan, 2009).

The choice of segmentation approach depends on available linguistic resources and in-

tended applications. Rule-based methods are effective when comprehensive linguistic


CHAPTER 1. INTRODUCTION 6

knowledge is available, whereas unsupervised approaches are preferable when annotated

data is scarce. Supervised learning, despite its reliance on annotated data, offers superior

accuracy and generalisation when adequate training data is available. In data-sparse set-

tings, semi-supervised techniques provide a viable compromise, leveraging small annotated

corpora alongside large raw text corpora for enhanced learning.

To date, supervised surface segmentation for isiZulu remains underexplored, despite its

potential to improve downstream NLP tasks. A primary limitation in this regard is the

lack of annotated surface segmented data required for training such models. To address

this gap, the present study leverages the ZRG (Marais & Pretorius, 2023), a rule-based

system implemented using the GF programming language.

The ZRG allows for the morphosyntactic analysis of isiZulu by generating syntax trees

through parsing, which involves a deep linguistic analysis of the text. While this process

can be slow, it exposes lexical categories (e.g., tense, polarity, and agreement markers)

that can be systematically manipulated to generate synthetic data, a process referred to

as data augmentation.

Additionally, the ZRG supports a linearisation process, which converts syntax trees into

human-readable text. During linearisation, the grammar employs a binding token, fa-

cilitating the construction of surface forms and allowing for the specification of differ-

ent segmentation granularities. This method allows for flexibility regarding segmentation

strategies, enabling the exploration of optimal strategies based on linguistic approaches to

isiZulu subword modelling.

It is essential to note that ZRG, being a rule-based system, suffers from the aforementioned

limitations of rule-based approaches, such as being less robust when encountering words

with the defined lexicon, hence failing to parse it and generate its syntax tree. It is also

very slow, making it a less preferred strategy to be used for downstream tasks, where speed

and accuracy are important factors. Moreover, ZRG was not originally implemented as a

surface segmentation tool.

1.3 Problem Statement

To enable the interplay between rule-based and machine-learning approaches in investigat-

ing the morphological surface segmentation of isiZulu, this work leverages the strengths

of the ZRG. The ZRG provides broad coverage of isiZulu, supports flexible segmentation

styles, and enables the generation of surface-segmented data at different granularities in

a linguistically informed manner. This approach eliminates the need for extensive system

engineering or expert curation, which can be costly and time-consuming. However, recog-

nising the limitations of rule-based systems, such as their lack of robustness and slower


CHAPTER 1. INTRODUCTION 7

processing speeds, this study proposes training a supervised machine learning model using

the ZRG-generated annotated data. Supervised models are well-suited for generalising

to unseen words, offering greater robustness and efficiency in segmentation, hence serving

better as a pipeline to be used for investigating morphological segmentation in downstream

tasks where fast and accurate segmentation is crucial.

1.4 Research Questions, Objectives and Hypotheses

The discussion in the previous section leads to the two research questions to be addressed

in this study.

1.4.1 Research Questions

This research seeks to address the following questions:

1. How can a hybrid approach to supervised surface segmentation be pursued that

leverages the strengths of a rule-based approach to overcome the weaknesses of a

machine learning approach in the context of isiZulu?

2. How can such a hybrid approach be used to obtain insights into supervised surface

segmentation strategies for isiZulu?

1.4.2 Research Objectives

In order to answer the research questions, the following primary objective will be pursued.

To investigate supervised surface segmentation of isiZulu text using synthetic data gener-

ation

In order to reach this primary objective, the following sub-objectives have been identified.

• Sub-objective 1 : To utilise the ZRG to parse a large corpus of isiZulu text to obtain

a treebank.

• Sub-objective 2 : To utilise the ZRG to augment the treebank to obtain additional

synthetic data.

• Sub-objective 3 : To investigate the suitability of different machine learning models

for the supervised segmentation task.


CHAPTER 1. INTRODUCTION 8

• Sub-objective 4 : To train segmentation models on different linearisations of the

treebanks.

• Sub-objective 5 : To evaluate segmentation models intrinsically and extrinsically ac-

cording to suitable metrics.

1.5 Research Methodology

This study employs the hypothetico-deductive methodology to investigate the morpholog-

ical surface segmentation of the isiZulu text. Central to this methodology is the use of

a rule-based GF grammar, ZRG, to generate a large corpus of surface-segmented isiZulu

text across different granularities. This ensures linguistically informed and accurate seg-

mentation.

To further mitigate the matter of data sparsity, data augmentation techniques are applied

to enhance the diversity of the generated corpus. These datasets are used to train su-

pervised segmentation models, which are subsequently evaluated both intrinsically (using

metrics such as F1-score, BLEU, and chrF) and extrinsically through their impact on

neural machine translation (NMT) performance.

The approach integrates comprehensive data preparation, segmentation modelling, and

systematic evaluation to address linguistic and technological challenges inherent to low-

resource, morphologically complex languages like isiZulu. By employing this hypothesis-

driven methodology, the research ensures a structured and systematic exploration of the

role of morphological segmentation in enhancing downstream NLP tasks such as NMT.

Detailed discussions of each phase, including corpus generation, segmentation modelling,

and evaluation strategies, are presented in subsequent chapters. The hypotheses governing

this method are outlined below.

1.5.1 Hypotheses

In line with the hypothetico-deductive methodology, the following hypotheses are investi-

gated.

1. In developing a hybrid approach to supervised surface segmentation of isiZulu, using

a machine learning approach as a foundation will ensure robustness and efficiency,

while a combination of synthetically generated and automatically annotated data

would address the requirement of machine learning approaches for large amounts of

training data.


CHAPTER 1. INTRODUCTION 9

These sub-hypotheses have been developed:

(a) The hybrid approach will exceed the rule-based approach in robustness;

(b) The hybrid approach will exceed the rule-based approach in efficiency; and

(c) The hybrid approach will reach a sufficient1 level of accuracy in comparison to

the rule-based approach.

2. The hybrid approach will result in one or more segmenters that improve performance

in a downstream task such as machine translation.

1.6 Ethical Considerations

This study adheres to ethical principles to ensure responsible research practices in the

development and evaluation of morphological segmentation. Ethical concerns are particu-

larly relevant during data collection, processing, and model evaluation, given the potential

social and linguistic implications of working with a low-resource language.

IsiZulu, like many other low-resource languages, lacks large publicly available datasets

that are both diverse and high quality. This poses ethical challenges during data collec-

tion, as the scarcity of digital data resources increases the risk of bias, inaccuracies, and

misrepresentations that could negatively impact the digital development of the language.

To mitigate these risks, careful and ethically responsible data collection is essential to

ensure fair representation, linguistic integrity, and minimal bias in the development of

computational models.

1.6.1 Data Collection and Use

The data used in this study was sourced from publicly available corpora and other com-

monly used datasets in NLP research. However, these datasets may have been originally

collected through web scraping or other automated means, making them susceptible to

biases present on the internet or other sources. Consequently, it is possible that inherent

biases from the original data sources could be reflected in the dataset used for training

the models in this study.

In addition to that, special care has been taken to ensure that the collection and processing

of isiZulu text respect copyright regulations and do not violate any intellectual property

rights. The study follows best practices for dataset attribution, and proper citations are

provided for all corpora used.

1The notion of sufficiency is discussed in detail in Chapter 7.


CHAPTER 1. INTRODUCTION 10

1.6.2 Privacy and Cultural Sensitivity

Since the datasets primarily consist of linguistic data, they contain minimal personally

identifiable information (PII), making privacy risks relatively low. However, ethical con-

siderations go beyond privacy concerns and extend to the linguistic and cultural sensitivity

associated with working with indigenous languages such as isiZulu.

This research does not seek to alter the natural structure of isiZulu, but rather aims to

enhance its digital representation in NLP applications. This is made possible through

the ZRG, which has been developed with a focus on linguistic transparency and integrity

(Marais & Pretorius, 2023).

1.6.3 Transparency and Reproducibility

To promote open science and ensure that the findings contribute to the broader NLP

research community, this study adheres to transparency and reproducibility standards by:

1. Clearly documenting data sources, pre-processing steps, and model training config-

urations. All related resources will be made publicly available and accessible in the

GitHub repository2.

2. Providing implementation details to enable reproducibility.

3. Sharing insights into the strengths and limitations of the proposed methods.

By addressing these ethical considerations, this study aims to contribute responsibly to

NLP research while upholding principles of fairness, linguistic inclusivity, and social re-

sponsibility. Additionally, this research proposal was submitted for ethical approval to

North-West University, where it was reviewed and approved by the relevant Research

Ethics Committee. The assigned ethics approval number is NWU-01328-23-A9.

1.7 Dissertation Structure

This study is made up of the following chapters:

Chapter 1 – Introduction: This chapter provides an overview of the research, including the

problem statement and the research objectives, and summarises the research methodology.

It sets out the research context and outlines the thesis’s structure.

2https://github.com/Sthesha/supervised-surface-segmentation


CHAPTER 1. INTRODUCTION 11

Chapter 2 – Research Methodology: This chapter outlines the hypothetico-deductive method-

ology guiding this study. It establishes the research paradigms and provides a structured

methodology for the investigation of morphological surface segmentation for isiZulu texts

using supervised machine learning approaches to gain insight into different segmentation

strategies.

Chapter 3 – Morphological Segmentation: This chapter provides an overview of morpho-

logical segmentation, covering issues pertaining to linguistic background, rule-based ap-

proaches with examples, machine learning techniques, as well as metrics and benchmarking

used for evaluating segmentation.

Chapter 4 – Data Preparation: This chapter outlines the process of preparing the data for

training the supervised surface segmentation models. It includes the generation of surface

segmented data using rule-based GF grammar, data augmentation techniques, and the

creation of a representative dataset for isiZulu.

Chapter 5– Models Design: This chapter outlines the design and development of the

supervised surface segmentation models, detailing the selection of learning algorithms,

model architecture, parameter optimisation, and the training process. It also provides the

rationale for design choices and their alignment with the research objectives.

Chapter 6 – Experimental Evaluation: This chapter presents an evaluation of the devel-

oped supervised segmentation models, focusing on both intrinsic metrics (e.g., F1-score,

BLEU, chrF) to assess segmentation quality and extrinsic metrics to evaluate their impact

on downstream NMT performance. It includes comparative analyses across segmentation

granularities and a discussion of the results in the context of addressing data sparsity in

isiZulu.

Chapter 7 - Findings, Conclusion and Future Research: This chapter provides a concise

summary of the preceding chapters, the results of the formulated hypotheses, and based

on the key findings, it addresses the outlined research questions. In addition, it reflects

on the challenges and limitations that have been identified and suggests future research

directions. Finally, it provides the study’s overall conclusions.

1.8 Chapter Summary

This chapter provided an introduction, background, and overview of the study. These were

followed by its research objectives, focusing on addressing the challenges posed by data

sparsity and morphological complexity in isiZulu, a low-resource and morphologically rich

language. Through a systematic exploration of the research aims, the study emphasises

the potential of supervised surface segmentation models in enhancing the performance of

NLP applications.


CHAPTER 1. INTRODUCTION 12

The objectives provide a clear roadmap for the study, beginning with the generation of

a large, linguistically valid, and representative corpus of surface-segmented isiZulu text

using a rule-based grammar, known as ZRG. The study also proposed the integration of

data augmentation techniques to address data sparsity and the identification of the most

effective supervised learning algorithms for this investigation which also underline the

study’s focus on the methodological rigour approach. This methodology is grounded in

the hypothetico-deductive approach, which formulates several hypotheses that are aligned

with the study objectives and tests them accordingly.

Finally, the inclusion of an extrinsic evaluation objective, where the impact of segmentation

granularity is measured through NMT performance, emphasises the practical relevance and

broader implications of this research. By addressing these objectives, this study aims to

contribute meaningfully to the field of NLP for low-resource languages, setting the stage

for future advancements in morphological segmentation and its applications.


Chapter 2

Research Methodology

2.1 Introduction

The present chapter sets out to present a methodology for developing a supervised surface

segmentation model for isiZulu using synthetic data generated using the ZRG and evaluat-

ing its effectiveness in addressing the data sparsity issue in a downstream task, specifically

a machine translation system. Building upon the foundational concepts introduced in the

previous chapter, this chapter elaborates on the systematic approach adopted to achieve

the research objectives.

This chapter explores the research philosophies and paradigms that inform the study and

outlines the specific methodological framework implemented. The focus of this study is to

investigate the surface segmentation of isiZulu text with different levels of granularity using

a supervised machine learning approach developed using synthetic data. By leveraging

the hypothetico-deductive method as the guiding approach, this study enables hypothesis

formulation and empirical testing, ensuring that the chosen methodologies are well suited

to tackle the challenges of agglutinative languages such as isiZulu.

By embedding the research within established philosophical and paradigmatic frameworks,

this chapter not only establishes a foundation for a methodological approach, but also

emphasises the significance of morphological segmentation in terms of addressing data

sparsity for low-resource languages with complex morphologies, such as isiZulu, in the

context of language modelling.

This chapter is structured as follows: Section 2.2 provides a foundational perusal of re-

search as a concept, leading to Section 2.3 and Section 2.4, which discuss different research

paradigms and philosophies. These are presented as assumptions and beliefs that guide

a researcher’s approach to conducting a study. Section 2.5 positions the present study

within the introduced assumptions and beliefs, which also influence the chosen method-

13


CHAPTER 2. RESEARCH METHODOLOGY 14

ology, namely the hypothetico-deductive methodology, which is discussed in Section 2.6.

Finally, Section 2.7 concludes the chapter with a summary.

2.2 The Concept of Research

The term “research” has been used interchangeably in the literature, encompassing various

interpretations. According to the Cambridge Dictionary (n.d.), research is defined as, “a

detailed study of a subject, especially in order to discover (new) information or reach

a (new) understanding”. Khaldi (2017) describes research as the systematic acquisition

of knowledge through meticulous and structured investigation. Similarly, Dane (1990)

emphasises research as a process aimed at asking and answering questions about the

world.

As a process, research enables the adoption, refinement, or rejection of certain knowledge

based on evidence and analysis. The aims of the research are extensive, including exploring

new phenomena, explaining causes and relationships, evaluating interventions, predicting

outcomes, understanding complex systems, and developing practical solutions to real-

world problems (Collis & Hussey, 2014; Garg, 2016). In addition, research plays a crucial

role in contributing towards theory development and testing, providing a foundation for

scientific progress and application (Whetten, 1989).

According to Goundar (2012), three key conditions must be met when undertaking re-

search. Firstly, the research process must be governed by clearly defined methodologies,

such as qualitative or quantitative approaches, and influenced by the standards and prac-

tices inherent to the researcher’s discipline. Secondly, the researcher should employ tools,

methods, and techniques that have been rigorously tested for reliability and validity. Reli-

ability pertains to the consistency and repeatability of the chosen methods, while validity

ensures that these methods are chosen correctly and accurately measure the intended phe-

nomena. Lastly, the researcher must conduct the research objectively and without bias,

thus striving to eliminate bias or vested interests that may compromise the impartiality

and credibility of the outcomes. Throughout the research process, conclusions should be

reached based on the evidence gathered, avoiding the incorporation of personal biases or

interests unrelated to the study’s objectives (Goundar, 2012).

Having established the foundational understanding of research as a systematic process for

acquiring and refining knowledge, it is essential to explore the broader philosophical and

paradigmatic frameworks that underpin and guide research practices. These frameworks,

including the research paradigms and research philosophies, serve as the theoretical lenses

through which researchers view the world, formulate questions, and select appropriate

methods. By exploring these paradigms and philosophies, one can better understand how

they influence the design, execution, and interpretation of this study.


CHAPTER 2. RESEARCH METHODOLOGY 15

2.3 Research Paradigms

It is salient to consider the concept of research paradigms, because these guide the ac-

quisition of knowledge and scientific discoveries through their underlying principles and

assumptions (Park et al., 2020). The term “paradigm”, often referred to as a worldview,

encompasses the fundamental collection of beliefs, theories, philosophical assumptions,

and ideas that researchers hold. These paradigms shape the design and execution of their

research (Mafuwane et al., 2011). A research paradigm provides a lens through which

researchers interpret the world and evaluate the methodological aspects of their study,

ultimately influencing the choices for data collection and analysis strategies.

Conducting scientific research requires careful consideration of the research paradigm, as

this paradigm establishes specific assumptions about how the world operates. Indeed, My-

ers (2002) emphasises the importance of adopting an appropriate research paradigm to en-

sure validity, thus allowing researchers to build on their philosophical stance. Furthermore,

Clavier et al. (2012) highlight that paradigms serve as critical reference points for others

to understand the researcher’s underlying assumptions, which have a significant bearing

on the study’s design and outcomes. These paradigms are commonly categorised into four

key domains: ontological, epistemological, axiological, and methodological (Creswell &

Creswell, 2017).

2.3.1 Ontological Assumptions

Ontological assumptions refer to the nature of reality, whether it is regarded as objective,

subjective, or socially constructed, and encompass the aspects to be investigated or en-

countered during the research process (Alele & Malau-Aduli, 2023). According to Guba

and Lincoln (1994:108), ontological assumptions aim to address the question: What is the

form and nature of reality, and what can be known about it? Simply put, these assump-

tions focus on understanding the fundamental nature of the phenomena being researched.

The nature of reality implies different ontological perspectives, which, in turn, influence

the research approach. Assumptions about reality are often categorised along a spectrum,

ranging from an objective reality that exists independently of human perception to a sub-

jective reality that is shaped by individual experiences and interpretations (Ahmed, 2008).

Ontological assumptions play a critical role in structuring a researcher’s thinking about

the topic under investigation. Such assumptions are salient in guiding the formulation of

research questions and determining how those questions are addressed (Kivunja & Kuyini,

2017).


CHAPTER 2. RESEARCH METHODOLOGY 16

2.3.2 Epistemological Assumptions

Epistemology is a branch of philosophy that focusses on the study of knowledge and

beliefs, describing how knowledge about reality is acquired, conceptualised, and applied

(Hatch, 2018). According to Guba and Lincoln (1994:108), this assumption addresses

the question: What is the nature of the relationship between the knower or would-be

knower and what can be known? The response to this question is often influenced by the

ontological assumptions underpinning the research.

Epistemology seeks to explore questions such as: How is knowledge produced? What stan-

dards distinguish good knowledge from bad knowledge? And how should reality be defined

or represented (Hatch, 2018)? Epistemology also pertains to how one can communicate

this knowledge to others (Burrell & Morgan, 2019).

2.3.3 Axiological Assumptions

Axiology pertains to values and ethics, reflecting the role of a researcher’s personal values

and biases in the research process, including actions taken after the research is completed

(Saunders, 2009). It influences the entire research process, and is critical for ensuring

the credibility and integrity of the study. As noted in Pretorius (2024), axiology encour-

ages researchers to reflect on how their own values, beliefs, and biases shape the design,

execution, and interpretation of their studies.

Furthermore, axiology assumptions are interconnected with ontological and epistemolog-

ical assumptions. Understanding the nature of reality (ontology) enables researchers to

assess the truth value of knowledge, while determining the knowability of this reality (epis-

temology) informs the science of truth. Once these foundations have been established,

defining the correct science of values (axiology) becomes more straightforward (Engle,

2008).

2.3.4 Methodology

This dimension of the research undertaking is concerned with the general strategy or action

plan that guides the choice and use of specific methods in the context of a particular

research paradigm (Wahyuni, 2012). It refers to a system of methods, procedures, or

principles employed to achieve specific objectives. In research, methodology can be defined

as systematic procedures or strategies that researchers use to describe, explain, and predict

phenomena or to carry out a research project (Aguiar, 2024). According to Guba and

Lincoln (1994), a methodology seeks to determine the steps that an inquirer must take to

uncover what they believe can be known.


CHAPTER 2. RESEARCH METHODOLOGY 17

A research methodology helps to clarify several aspects of a research project, such as

why it was conducted, how and why the hypothesis was formulated, and the methods or

techniques chosen to address the research problem. This includes details about the data

used, how it was collected, and related questions. A closely related concept and often used

interchangeably is research methods, which refer to the specific procedures employed in

conducting research (Howell, 2012). As Goundar (2012:45) aptly state, research method-

ology, “refers to more than a simple set of methods; rather it refers to the rationale and

the philosophical assumptions that underlie a particular study relative to the scientific

method”, while research methods constitute the execution phase of both scientific and

non-scientific research.

Similarly to other paradigmatic assumptions, the specific methodological choices that a

researcher adopts are influenced by their responses to ontological, epistemological, and

axiological questions. This interconnectedness ensures that all these paradigms work to-

gether, enabling researchers to conduct systematic investigations that produce meaningful

and reliable knowledge.

In terms of methodologies, there are various methodologies that can be employed in differ-

ent types of research, and the term is generally considered to include research design, data

collection, and data analysis (Goundar, 2012). Research methodologies are commonly cat-

egorised into two main types: quantitative and qualitative methodologies (Onwuegbuzie

& Leech, 2005). These are briefly discussed below.

1. The quantitative methodology emphasises numerical attributes, relying on objec-

tive measurements and statistical analysis to explore the relationships between vari-

ables and describe the causes of change (Kornuta & Germaine, 2019). Quantitative

research involves gathering numerical (quantitative) data, which is systematically

analysed using statistical methods to address specific research questions or hypothe-

ses. Researchers who use quantitative methods often propose hypotheses, collect

numerical data, and use statistical evidence to support or refute these hypotheses,

enabling broader generalisations based on a scientific approach (Rana et al., 2023).

2. In contrast to quantitative methods, qualitative research methodologies utilise non-

numerical data, such as observations, textual analysis, and interviews, to answer

open-ended questions such as “how” and “why”, which makes it ideal for conduct-

ing research in investigating non-linear phenomena such as experiences, perspectives,

and behaviours that can be too complex to be captured using quantitative methods

(Goundar, 2012; Tenny et al., 2017). Rather than relying on statistical analyses

or pre-formulated hypotheses, qualitative studies often involve open-ended inquiry,

allowing hypotheses or theories to emerge naturally during the research process (Ko-

rnuta & Germaine, 2019). The researcher plays an active role in interpreting data,

relying heavily on subjective insights and descriptive observations. While qualita-


CHAPTER 2. RESEARCH METHODOLOGY 18

tive approaches inherently involve subjective interpretations by the researcher, these

interpretations are grounded in rigorous and structured analysis of collected data,

rather than mere personal opinions or ungrounded assumptions.

Understanding and reflecting upon these beliefs is essential, as they underpin the method-

ological choices and strategies employed in any research project. By clarifying these per-

spectives, researchers can ensure that their study aligns with its philosophical foundations,

enhancing its rigour and relevance. Again, these philosophical assumptions are part of the

research philosophy, which includes various possibilities.

2.4 Research Philosophy

The term research philosophy is an evolving concept that lacks a unified definition among

scholars and is often used interchangeably with research paradigms. According to Žukauskas

et al. (2018), a research philosophy represents the system of thought that a researcher

adopts to produce new and reliable knowledge about their research object. Similarly,

Saunders (2009) describes it as the development of knowledge and its nature, emphasising

that it embodies critical assumptions about a researcher’s perspective of the world, which,

in turn, influences their research strategy and methods.

This study focuses on a subset of the research philosophies most widely discussed in the

social sciences: positivism, interpretivism, pragmatism, and critical realism, as highlighted

by Žukauskas et al. (2018) and Ryan (2018). These were selected due to their relevance

to the study’s objectives and their widespread application in social science research

2.4.1 Positivism

The positivist philosophy asserts that the social world can be studied and understood ob-

jectively (Žukauskas et al., 2018), much like the natural world. This philosophical stance

emphasises the use of scientific methods to produce credible data and facts, which are

considered to be independent of human interpretation or bias. Positivist studies primar-

ily rely on quantitative data, utilising experiments and established theories to formulate

hypotheses that can be rigorously tested, confirmed, or rejected (Saunders, 2009).

A key characteristic of positivist research is its focus on objectivity and detachment,

aligning with its axiological assumption. Researchers who adhere to this philosophy strive

to minimise personal values or subjective influences that could impact research outcomes

(Irshaidat, 2022). The ontological assumption upheld by positivism asserts the existence of

a rigid, objective reality which can be understood through appropriate instruments, data

collection, and empirical analysis. Epistemologically, positivism assumes that knowledge


CHAPTER 2. RESEARCH METHODOLOGY 19

about this reality is acquired through rigorous observational strategies, ensuring that it

remains objective and unaltered (Ringberg & Reihlen, 2008).

By emphasising empirical evidence and measurable phenomena, positivist approaches seek

to arrive at conclusions that are replicable and universally valid. This makes it particu-

larly suitable for research that involves systematic observation, controlled experimentation,

and hypothesis-driven inquiries. In this study, a positivist approach is adopted to validate

claims about the performance of supervised machine learning approaches, when these are

trained on a combination of synthetically generated and automatically annotated data, to

ascertain whether they enhance the robustness and efficiency of morphological segmenta-

tion models for isiZulu by addressing the data sparsity challenge inherent in low-resource,

morphologically rich languages.

Having established the foundational understanding of research as a systematic process

of acquiring and refining knowledge, it is essential to look at the broader philosophical

and paradigmatic frameworks that underpin and guide research practices (Alele & Malau-

Aduli, 2023). These frameworks, encompassing research paradigms and philosophies, serve

as the theoretical lenses through which researchers view the world, formulate questions,

and select appropriate methods.

2.4.2 Interpretivism

Another prominent research philosophy is interpretivism or constructivism, which emerged

as a critique of positivism (Yong et al., 2021). Interpretivism posits that the social world

can be understood in multiple ways by different individuals, highlighting the crucial role

of researchers in observing and interpreting this world (Ali, 2023). It is grounded in

constructivist ontology, which asserts that reality is not discovered but constructed through

human interactions and social processes (Goldkuhl, 2012).

Researchers conducting interpretivist studies are often required to adopt an empathetic

viewpoint, allowing them to better understand the perspectives of the research subjects.

This reflects a fundamental epistemological assumption of interpretivism, namely that

knowledge is subjective and arises from interactions between the researcher and the par-

ticipants (Saunders, 2009). The associated axiological perspective acknowledges the re-

searcher’s participation in the data generation process. However, it also emphasises the

importance of consciously listening to the participants and evaluating their input without

contaminating it with personal biases or values, ensuring the reliability of the findings

(Luyt et al., 2012).

This approach is typically used to study people and their social interactions, contrasting

with positivist approaches that often focus on natural sciences and non-human objects such

as computers (Saunders, 2009). However, it also emphasises the importance of consciously


CHAPTER 2. RESEARCH METHODOLOGY 20

listening to the participants and evaluating their input without contaminating it with

personal biases or values, ensuring the reliability of the findings.

2.4.3 Pragmatism

The two approaches discussed above, positivism and interpretivism, represent opposite

ends of the research philosophy spectrum, each with clear advantages and limitations.

However, many research studies do not fall entirely into one of these categories, as they

often involve exploring both rigid and subjective realities. This calls for a more flexi-

ble approach that combines elements from multiple philosophies, where pragmatism as a

research philosophy comes in.

Pragmatism asserts the possibility of working with diverse assumptions, including onto-

logical and epistemological perspectives from both positivist and interpretivist stances

(Saunders, 2009). As noted by Alghamdi and Li (2013), pragmatism is not restricted

to any one system of philosophy or reality. Instead, it allows researchers the flexibility

to select methods and techniques most suitable for addressing their research questions,

whether these are quantitative or qualitative, and regardless of the type of data being

analysed.

This philosophical stance emphasises the practical outcomes of research and supports the

use of mixed-methods approaches, combining the rigour of scientific inquiry with the depth

of interpretive understanding. Pragmatism, therefore, serves as a valuable framework for

studies that require methodological diversity and adaptability to address complex research

problems effectively.

2.4.4 Critical Social Theory

Critical Social Theory (CST) refers to a broad range of theoretical approaches that critique

existing social conditions and power structures with the goal of emancipating humans and

the planet from the hostile consequences of modernity (Celikates & Flynn, 2023; Manners,

2020). The CST is rooted in the Frankfurt School’s “Critical Theory” that emerged in the

1930s as interdisciplinary research approach that combines philosophy and social science

with the emancipatory objectives of various social and political movements. The CST

examines how different dimensions of domination and oppression, such as economic, racial,

gendered, and political, shape society and seeks not just to understand these dynamics

but to challenge and change them (Browne, 2000).

The ontological assumption of CST can be described as historical realism or critical real-

ism, however, in a different sense from what is presented by Bhaskar (2013). This means

that CST does believe in a reality (particularly social reality) that is “real” and has objec-


CHAPTER 2. RESEARCH METHODOLOGY 21

tive consequences; this “reality” is not fixed or immutable; it is shaped by historical and

social forces, especially by relations of power and oppression (Scotland, 2012). Guba and

Lincoln (1994:109) characterise the ontology of this research philosophy as “historical re-

alism – a reality that is shaped by social, political, cultural, economic, ethnic, and gender

values, crystallised over time”. In simpler terms, what we take to be “reality” (especially

in the social world) is the product of history and power dynamics – for example, racial

categories or gender roles are real in their consequences, but are historically constructed

rather than being natural phenomena.

In terms of epistemological assumptions, CST is described as transactional or subjectivist,

meaning that the inquirer and the object of inquiry are not independent, but interactively

linked. As a result, the values of the investigator inevitably influence the findings (Guba &

Lincoln, 1994). Within this framework, the researcher’s values serve as primary resources

driving the research philosophy. Rather than attempting to remain a neutral observer, the

researcher explicitly uses their values to uncover and interpret social truths. According to

Cohen (as cited in Scotland, 2012:35), “what counts as knowledge is determined by the

social and positional power of advocates of that knowledge”.

Concerning axiological assumptions, CST places values at its core, explicitly addressing

normative questions such as “What is intrinsically worthwhile?” (Scotland, 2012:13). This

inherently normative orientation emphasises respect for cultural norms and societal values

(Kivunja & Kuyini, 2017; Scotland, 2012). Central to CST are values of emancipation

and social justice, which form the foundational components of its research paradigm.

Methodologically, CST favours approaches that promote critical insight, active participa-

tion, and social change. Since the ultimate goal of CST is not merely to study society

but to transform it, the methodology employed is often described as dialogic (Kivunja &

Kuyini, 2017). Ideology critique represents a core methodological practice, enabling crit-

ical analysis and interrogation of underlying values and assumptions to expose injustice

(Scotland, 2012). Furthermore, CST frequently incorporates action research as a means

of actively seeking to change social realities (Dieronitou, 2014).

2.4.5 Critical Realism

Critical realism (CR) is a relatively recent and comprehensive philosophy of science de-

veloped by the English philosopher Roy Bhaskar as an alternative to the prevailing philo-

sophical paradigms of positivism, interpretivism, and pragmatism (Lawani, 2021). CR

comprises two primary components: transcendental realism and critical naturalism, ad-

dressing the philosophy of science and social science, respectively (Bhaskar, 2013; Bhaskar,

2014). Transcendental realism is focused on the ontology of the natural world, asserting

that real mechanisms exist beyond empirical observation and hermeneutic interpretation.


CHAPTER 2. RESEARCH METHODOLOGY 22

Critical naturalism extends this realist ontology to the social realm, suggesting that un-

derlying social structures and mechanisms also exist beyond immediate observation and

interpretation (Bhaskar, 2013; Zhang, 2023). Ultimately, critical realism asserts that an

objective reality exists (as in positivism), but our knowledge of this reality is inherently

partial and shaped by theoretical frameworks. Consequently, critical realism emphasises

the necessity of exploring deeper structures and causal mechanisms that operate or exist

beyond direct empirical observation.

Similarly to pragmatism, CR advocates for mixed-methods research. However, unlike

pragmatism, which views ontological and epistemological assumptions as separable from

research methods, CR explicitly associates these methods with its philosophical positions.

Specifically, CR maintains a layered ontology consisting of three distinct domains: the

real, the actual, and the empirical (Holmén, 2020). The real domain refers to the deep

structures, properties, and mechanisms of entities that exist independently of observation.

The actual domain encompasses the actual events generated by these underlying mecha-

nisms, irrespective of whether they are observed. Finally, the empirical domain represents

the observable experiences or events perceived by individuals. CR thus rejects simplistic

reductions of reality to only what can be directly observed. Instead, it asserts that many

critical aspects of social reality, such as social structures and class relations, are real but

not directly observable, requiring inference based on their observable effects.

Methodologically, CR advocates for a pluralistic approach to research, employing any

methods that effectively identify and explain causal mechanisms. Given that CR ontology

is complex, with multiple layers of reality, both quantitative and qualitative methods can

be combined to capture different aspects of a phenomenon (Sobh & Perry, 2006). This

methodological flexibility allows researchers to investigate not only observable patterns,

but also the underlying structures and causal relationships that shape social and natural

phenomena.

2.5 Positioning the Present Study

This study is firmly grounded in a positivist philosophical framework, which emphasises

objectivity, empirical evidence, and hypothesis-driven inquiry. The ontological assumption

underlying this research is that an objective reality exists, wherein morphological patterns

(the structure of words and how morphemes combine) in isiZulu can be systematically

measured and analysed. This perspective aligns with the positivist view that reality is

independent of human perception and can be understood through structured observation

and experimentation.

Epistemologically, the study adopts a stance that valid and reliable knowledge or insights

about the relationship between segmentation granularity and downstream task perfor-


CHAPTER 2. RESEARCH METHODOLOGY 23

mance can be acquired through empirical investigation. By leveraging comprehensive and

repeatable experimentation, the research seeks to uncover insights around the supervised

morphological segmentation of isiZulu texts using a rule-based approach as a pipeline to

generate the syntactic data of different granularity. This approach ensures that findings

are grounded in measurable evidence rather than subjective interpretation.

Axiologically, the study reflects the positivist emphasis on objectivity and detachment.

Personal values and biases are consciously minimised to maintain the integrity of the re-

search process and outcomes. Ethical considerations are central, with transparency in data

selection, methodological soundness, and the reproducibility of results being consistently

prioritised in order to uphold the study’s credibility.

2.6 Hypothetico-Deductive Methodology

Since this study assumes the positivist paradigm approach, the most common method

for such in scientific research is the hypothetico-deductive methodology. This is an ap-

proach that is regarded to be the heart of scientific inquiry (Fosl & Baggini, 2020). This

methodology comprises two primary components: the hypothetico part, where an explana-

tory hypothesis is formulated to address the research problem, and the deductive part,

where testable claims are derived from the hypothesis and empirically examined. The

results after the examination can be a substantiated or falsified hypothesis (Park et al.,

2020). According to Fosl and Baggini (2020:47) the principle governing this procedure is

to, “start with a hypothesis and a set of given conditions, deduce what facts follow from

them and then conduct experiments to see if those facts hold and hence whether the hy-

pothesis is false”. Sekaran and Bougie (2016) and Tariq (2015) present this methodology

as a seven-step process, as detailed below.

1. Identify a broad problem area

The first step in the research process is to identify a general area of interest or

concern that serves as the foundation for the research project. This step involves

scanning existing literature, identifying gaps, and noting practical challenges in the

field.

2. Define the problem statement

To conduct a scientific research study, one needs to have a definite aim or purpose

(Sekaran & Bougie, 2016). In this step, a clear and concise problem statement should

be articulated, which highlights the issue that the study seeks to address and provides

the foundation for the research objectives and questions. Preliminary information

is gathered to understand the factors contributing to the problem, narrowing the

broad area into a specific problem statement.


CHAPTER 2. RESEARCH METHODOLOGY 24

3. Develop hypotheses

Based on observations and existing theories, hypotheses are developed to guide the

study. These serve as predictions to be examined through empirical investigation

(Popper, 2002). In a scientific study, the conducted hypotheses must meet two key

criteria:

(a) Testability : The hypothesis must be capable of being empirically tested.

(b) Falsifiable: The hypothesis should be disprovable, because hypotheses cannot

be confirmed, but can only be corroborated until contradicted by new findings.

4. Determine measures

This step involves identifying and selecting appropriate tools, instruments, or meth-

ods to operationalise (quantify or observe) the variables defined in the research

hypothesis. Trochim et al. (2016) note that in this process, validity and reliability

are critical considerations to ensure that the measures capture the constructs being

studied effectively.

5. Data collection

Once the measures have been established, researchers proceed with gathering data

using suitable methodologies such as surveys, experiments, interviews, or observa-

tions. Data collection should follow ethical guidelines and adhere to a well-designed

protocol to ensure consistency and minimise bias (Fowler, 2014).

6. Data analysis

After collecting the data, researchers analyse it using numerical analysis such as sta-

tistical approaches or qualitative techniques appropriate to the nature of the study.

This step involves examining patterns, testing hypotheses, and drawing conclusions

based on the data’s evidence (Sekaran & Bougie, 2016).

7. Interpretation of data

The final step in this research process involves interpreting the results of the analysis

in the context of the original hypotheses and the broader problem area. The findings

provide insights into whether the hypothesis is supported or not. At this stage the

researcher has to discuss the implications, address potential limitations, and provide

recommendations for future research. In the event that the hypothesis was not

supported, the researcher should critically evaluate the reasons behind the found

outcome and refine the theory for retesting (Sekaran & Bougie, 2016).


CHAPTER 2. RESEARCH METHODOLOGY 25

2.6.1 Application in this Research

The hypothetico-deductive methodology serves as the guiding framework for this research,

where multiple hypotheses are presented that align with the main objective of the study,

which is: To investigate supervised surface segmentation of isiZulu text using synthetic data

generation. This is achieved through following five sub-objectives presented in Chapter 1

Section 1.4.2.

The following section outlines how each step of the hypothetico-deductive approach is

applied in this study:

1. Identifying a broad problem area

As discussed in Chapter 1, isiZulu’s complex agglutinative morphology allows for a

single stem to generate numerous word variations, significantly increasing vocabulary

size. This vocabulary explosion poses challenges for language modelling, as many

word variations may be absent from training data, leading to out-of-vocabulary

(OOV) issues where models struggle to recognise or generate unseen tokens effec-

tively.

One potential solution to mitigate data sparsity is morphological segmentation,

which reduces vocabulary size by breaking words into their constituent morphemes.

Morphological segmentation can be either through canonical or surface segmenta-

tion, where the former breaks the words into underlying morphemes, and the latter,

which divides them into morpheme-based substrings (Cotterell et al., 2016b). When

canonical segmentation is used, it is simpler for one to determine the “correct”

segmentation, since the canonical forms of morphemes of the language are well un-

derstood from a linguistic point of view. Less can be said about this in a surface

segmentation, because the high degree of morphophonological alternation makes it

less clear what a suitable segmentation should be and where to place the morpheme

boundary, when looking to ensure optimal results of a downstream task.

Both canonical or surface segmentation can be achieved either through rule-based

or machine learning-based approaches. Rule-based systems rely on expert-crafted

linguistic rules, while machine learning-based methods infer patterns from data,

which can be either labelled (supervised) or unlabelled (unsupervised). Studies have

shown that supervised approaches generally outperform unsupervised methods in

morphological segmentation (Belth, 2024; Ruokolainen et al., 2016; Wang et al.,

2016b).

Despite the advantages of machine learning-based segmentation, obtaining high-

quality annotated data remains a significant challenge, especially for low-resource

languages like isiZulu. The lack of pre-existing labelled datasets makes training

supervised models particularly difficult, necessitating alternative data generation.


CHAPTER 2. RESEARCH METHODOLOGY 26

2. Defining the problem statement

The study narrows its focus to the specific problem of data sparsity caused by

isiZulu’s morphological complexity. This sparsity negatively impacts the efficiency

of language models, particularly in NLP systems, where poor handling of morpho-

logical variation results in suboptimal results. Morphological segmentation has the

potential to alleviate this issue; however, the problem of annotated data in low-

resource languages makes it even harder to gain many insights about morphological

segmentation; the problem of data remains a more significant issue in investigat-

ing and gaining insight into its effects. In this sense, the present study explores

using a hybrid approach to morphological surface segmentation, using a rule-based

system, ZRG system, to generate synthetic morphological surface segmented data

of different segmentation granularity, to use this data to train supervised machine

learning morphological segmenters, and to investigate their performance intrinsically

and extrinsically.

3. Developing hypotheses

This study is based on two hypotheses:

(a) In developing a hybrid approach to supervised surface segmentation of isiZulu,

using a machine learning approach as a foundation will ensure robustness and

efficiency, while a combination of synthetically generated and automatically

annotated data would address the requirement of machine learning approaches

for large amounts of training data.

i. The hybrid approach will exceed the rule-based approach in robustness;

ii. The hybrid approach will exceed the rule-based approach in efficiency;

iii. The hybrid approach will reach a sufficient level of accuracy in comparison

to the rule-based approach.

(b) The hybrid approach will result in one or more segmenters that improve per-

formance in a downstream task, such as machine translation.

4. Determining measures

In determining the appropriate measures for evaluation, this study employs a two-

fold approach to assess both the efficiency and effectiveness of the proposed morpho-

logical segmentation system. Efficiency is evaluated by examining the ease and speed

with which the supervised machine learning model performs segmentation, partic-

ularly in comparison to the rule-based approach, which is used as the pipeline for

generating data. This involves measuring the processing time of both systems and

making a comparison as to which system is the fastest “efficient” in segmentation.

Effectiveness, on the other hand, will be measured in three key ways. Firstly, how

robust are the supervised surface segmentation systems in generating new tokens

that are not in the vocabulary (tokens that the ZRG could not segment)? The


CHAPTER 2. RESEARCH METHODOLOGY 27

second and third measures of effectiveness involve evaluating the developed system

intrinsically and extrinsically. In the intrinsic evaluation, the system output is di-

rectly assessed in terms of predefined standards or criteria that relate to the system’s

functionalities or objectives (Jones & Galliers, 1995; Resnik & Lin, 2010). In the sur-

face segmentation context, the intrinsic evaluation assesses how close or similar the

system-generated morphs (hypothesis) are to pre-generated (constructed) morphs

(ground truth). In text sequential problems, this assessment is usually conducted

through quantitative n-grams based metrics such as BLEU (Papineni et al., 2002),

CHaRacter-level F-score) (Popović, 2015) and classification metrics such as preci-

sion, recall and F1-score. Similarly, this work employs these metrics to evaluate how

similar the generated morphs are to the pre-generated morphs.

When conducting extrinsic evaluation, the developed system is treated as an en-

abling technology and is used to assess its impact on another system, a downstream

task (Resnik & Lin, 2010). In this case, the impact of segmentation models on a

downstream NLP application, Neural Machine Translation (NMT), is examined. The

effectiveness of different segmentation granularities is assessed by evaluating the per-

formance of the NMT system when trained on segmented versus unsegmented data.

This comparison provides insight into the extent to which morphological segmenta-

tion improves translation quality, as measured by BLEU and chrF scores. Through

this comprehensive evaluation framework, the study ensures that both segmentation

models and their practical application in the NLP downstream task are well assessed

with a view to gain useful insights.

5. Data collection

Two datasets are utilised in this research, which may also fall under the data collec-

tion step:

(a) A dataset for training the morphological segmenters across different levels of

granularity.

(b) A parallel isiZulu-English dataset that is filtered and preprocessed. The isiZulu

sentences are segmented using the developed segmenters, with the unsegmented

sentences serving as a baseline. This allows for a direct comparison of the

effectiveness of the segmentation strategies.

In a broader sense, data collection in this study extends beyond acquiring raw data

sets. It involves conducting multiple experiments across different datasets to anal-

yse the impact of morphological segmentation on model performance. This process

includes training, validation, and testing, where the data is systematically split to

ensure reliable evaluation. Insights are gathered through performance metrics, model

robustness, loss trends, and visualisation graphs, which provide a deeper understand-

ing of the model’s learning behaviour, token or morpheme predictions, and overall


CHAPTER 2. RESEARCH METHODOLOGY 28

system effectiveness. These experimental observations serve as a critical component

of data collection, enabling a comprehensive assessment of the developed system.

6. Data analysis

Statistical and computational methods are employed to analyse the data that has

been collected. This includes intrinsic evaluations being conducted to measure the

segmentation models’ quality, while extrinsic evaluations assess the impact of seg-

mentation on NMT performance. Comparative analyses are carried out across dif-

ferent granularity levels to identify trends and draw meaningful conclusions.

7. Interpretation of data

The results are examined to determine whether the hypothesis is supported. Key

findings, such as the relationship between segmentation granularity and their im-

pact on translation quality, are analysed. The implications for future research and

applications in low-resource language NLP are discussed, addressing the proposed

approach’s strengths and limitations.

By applying the hypothetico-deductive methodology, this study ensures a systematic and

structured approach to exploring how morphological segmentation can alleviate data spar-

sity challenges in isiZulu. Each step is aligned with the study’s objectives, ensuring that

the research generates valid, reliable, and actionable insights. Subsequent chapters will

provide in-depth discussions of these steps and their execution within the study’s context.

Figure 2.1 summarises the research methodology that allows for the systematic investiga-

tion of the research problem and the structured pursuit of the study’s objectives.


CHAPTER 2. RESEARCH METHODOLOGY 29

F
ig
u
re

2
.1
:

S
u

m
m

ar
y

of
th

e
re

se
ar

ch
m

et
h

o
d

o
lo

g
y.


CHAPTER 2. RESEARCH METHODOLOGY 30

2.7 Chapter Summary

This chapter outlined the research methodology employed to investigate the possibility of

using a hybrid rule-based ZRG and supervised machine learning approaches to conduct

morphological surface segmentation of isiZulu text with different granularity levels and

gain salient insights through intrinsic and extrinsic evaluation. Grounded in the positivist

paradigm, the study adopts the hypothetico-deductive methodology, ensuring a structured

and rigorous approach to hypothesis-driven scientific inquiry.

The chapter introduced foundational research philosophies and paradigms, emphasising

how ontological, epistemological, axiological, and methodological assumptions inform the

research process. These philosophical underpinnings set the stage for the study’s method-

ological choices, particularly its reliance on objectivity, empirical evidence, and systematic

observation. The positivist paradigm was identified as the guiding framework, aligning

with the study’s focus on measurable phenomena and hypothesis testing.

The hypothetico-deductive methodology was presented as the central approach for this

research, detailing its steps: identifying a broad problem area, defining a specific problem

statement, developing hypotheses, determining measures, collecting and analysing data,

and interpreting results. Each step was explicitly contextualised within the study ob-

jectives, demonstrating how the methodology supports a comprehensive test of the two

formulated hypotheses.

This chapter highlighted key components of methodology, including the selection of in-

trinsic and extrinsic evaluation metrics, the collection and preparation of datasets, and

the use of statistical and computational methods for data analysis. By ensuring align-

ment between the research questions, methodology, and objectives, the study provides a

pipeline to investigate the hypothesis in a systematic and scientifically governed approach

and, hence, achieving the primary objective.

In summary, this chapter has provided a comprehensive framework for conducting the re-

search, bridging philosophical considerations with practical methodological choices. Em-

bedding the study within a structured and hypothesis-driven framework helps to ensure the

reliability and validity of the findings. Subsequent chapters will build on this foundation,

delving into the implementation, experimentation, and evaluation processes, ultimately

addressing the broader implications of the research for low-resource languages like isiZulu.


Chapter 3

Morphological Segmentation

3.1 Introduction

This chapter sets out to provide a literature review on morphological segmentation with a

number of salient concepts that are associated with it. The review begins with a linguistic

background of Nguni languages with a particular focus on isiZulu, covering aspects of

phonology, phonetics, and morphology. The chapter then examines different strategies

and techniques that are currently employed in the literature to conduct morphological

segmentation. This ranges from rule-based approaches to data-hungry approaches such

as machine learning and deep learning techniques. The chapter concludes by looking into

different metrics that are commonly used to measure the performance of morphological

segmentation, and concludes with the trends and open questions that are presented within

this domain.

3.2 Background and Linguistic Characteristics of isiZulu

3.2.1 Background and History of isiZulu

This section provides a brief background and describes the linguistic features of isiZulu

(and its related languages) that require the kind of segmentation described in this study.

Zulu or isiZulu (as an endonym) is one of the 12 official languages in South Africa, and it is

considered to be the most widely spoken indigenous language in the country. IsiZulu speak-

ers comprise approximately a quarter of the population, with around 15.1 million home

language speakers out of the population of 62 million people in South Africa (StatsSA,

2022a). The majority of isiZulu native speakers are concentrated in KwaZulu-Natal, where

31


CHAPTER 3. MORPHOLOGICAL SEGMENTATION 32

the language is predominantly spoken in 80% of households, followed by Mpumalanga,

where it is the primary language in 27.8% of households, as shown in Figure 3.1. Apart

from South Africa, isiZulu is also spoken in other Southern African countries, includ-

ing Eswatini, Mozambique, and Namibia, although with smaller populations of speakers

(Asante & Mazama, 2009).

Figure 3.1: Distribution of isiZulu Speakers Across South African Provinces, Based on

Statistics 2022 Data

IsiZulu is part of the Nguni language family, which is