The Analysis of the Sepedi-English Code-switched Radio News Corpus
Abstract
Code-switching is a phenomenon that occurs
mostly in multilingual countries where multilingual
speakers often switch between languages in
their conversations. The unavailability of largescale
code-switched corpora hampers the development
and training of language models for the generation
of code-switched text. In this study, we
explore the initial phase of collecting and creating
Sepedi-English code-switched corpus for generating
synthetic news. Radio news and the frequency
of code-switching on read news were considered
and analysed. We developed and trained a
Transformer-based language model using the collected
code-switched dataset. We observed that the
frequency of code-switched data in the dataset was
very lowat 1.1%.We complemented our dataset with
the news headlines dataset to create a new dataset.
Although the frequencywas still low, the model obtained
the optimal loss rate of 2,361 with an accuracy
of 66%.
Collections
- Faculty of Engineering [1129]