Tswana finite state tokenisation
Date
2015Author
Pretorius, Laurette
Viljoen, Biffie
Berg, Ansu
Pretorius, Rigardt
Metadata
Show full item recordAbstract
Tswana, a Bantu language in the Sotho group, is characterised by an
agglutinative morphology and a disjunctive orthography, which mainly affects the
verb category. In particular, verbal prefixes are usually written disjunctively, while
suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot
be based solely on whitespace, as is the case in many alphabetic, segmented languages,
including the conjunctively written Nguni group of South African Bantu
languages. This paper shows how a combination of two finite state tokeniser
transducers and a finite state morphological analyser are combined to solve the
Tswana (verb) tokenisation problem. The approach has the important advantage of
bringing the processing of Tswana, beyond the morphological analysis level, in line
with what is appropriate for the Nguni languages. This means that the challenge of
the disjunctive orthography is met at the tokenisation/morphological analysis level
and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when
implemented and evaluated, yields an F$_1$-score of 95 % with respect to a hand
tokenised gold standard.
Collections
- Faculty of Humanities [2042]