dc.contributor.author | Pretorius, Laurette | |
dc.contributor.author | Viljoen, Biffie | |
dc.contributor.author | Berg, Ansu | |
dc.contributor.author | Pretorius, Rigardt | |
dc.date.accessioned | 2017-02-28T10:56:49Z | |
dc.date.available | 2017-02-28T10:56:49Z | |
dc.date.issued | 2015 | |
dc.identifier.citation | Pretorius, L. et al. 2015. Tswana finite state tokenisation. Language resources and evaluation, 49:831–856. [http://dx.doi.org/10.1007/s10579-014-9292-1] | en_US |
dc.identifier.issn | 1574–020X | |
dc.identifier.issn | 1574–0218 (Online) | |
dc.identifier.uri | http://hdl.handle.net/10394/20600 | |
dc.identifier.uri | http://dx.doi.org/10.1007/s10579-014-9292-1 | |
dc.description.abstract | Tswana, a Bantu language in the Sotho group, is characterised by an
agglutinative morphology and a disjunctive orthography, which mainly affects the
verb category. In particular, verbal prefixes are usually written disjunctively, while
suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot
be based solely on whitespace, as is the case in many alphabetic, segmented languages,
including the conjunctively written Nguni group of South African Bantu
languages. This paper shows how a combination of two finite state tokeniser
transducers and a finite state morphological analyser are combined to solve the
Tswana (verb) tokenisation problem. The approach has the important advantage of
bringing the processing of Tswana, beyond the morphological analysis level, in line
with what is appropriate for the Nguni languages. This means that the challenge of
the disjunctive orthography is met at the tokenisation/morphological analysis level
and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when
implemented and evaluated, yields an F$_1$-score of 95 % with respect to a hand
tokenised gold standard. | en_US |
dc.language.iso | en | en_US |
dc.publisher | Springer | en_US |
dc.subject | Finite state computational morphology | en_US |
dc.subject | Tswana | en_US |
dc.subject | Disjunctive orthography | en_US |
dc.subject | Tokenisation | en_US |
dc.subject | Verb morphology | en_US |
dc.title | Tswana finite state tokenisation | en_US |
dc.type | Article | en_US |
dc.contributor.researchID | 10203583 - Berg, Anna Susanna | |
dc.contributor.researchID | 10067256 - Pretorius, Rigardt Samuel | |