The Analysis of the Sepedi-English Code-switched Radio News Corpus

Ramalepe, Simon; Modipa, Thipe I; Davel, Marelie H

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Files

Ramalepe, S. The Analysis of the Sepedi-English.pdf (874.33 KB)

Date

2022

Authors

Ramalepe, Simon

Modipa, Thipe I

Davel, Marelie H

Publisher

UP Jornals

Abstract

Code-switching is a phenomenon that occurs mostly in multilingual countries where multilingual speakers often switch between languages in their conversations. The unavailability of largescale code-switched corpora hampers the development and training of language models for the generation of code-switched text. In this study, we explore the initial phase of collecting and creating Sepedi-English code-switched corpus for generating synthetic news. Radio news and the frequency of code-switching on read news were considered and analysed. We developed and trained a Transformer-based language model using the collected code-switched dataset. We observed that the frequency of code-switched data in the dataset was very lowat 1.1%.We complemented our dataset with the news headlines dataset to create a new dataset. Although the frequencywas still low, the model obtained the optimal loss rate of 2,361 with an accuracy of 66%.

Keywords

Code-switching, text generation, radio news, Transformers, Sepedi

Citation

Ramalepe, SM et.al.2022.The Analysis of the Sepedi-English Code-switched Radio News Corpus

URI

http://hdl.handle.net/10394/41783

Collections

Faculty of Engineering

Full item page

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By