Show simple item record

dc.contributor.authorPretorius, Laurette
dc.contributor.authorViljoen, Biffie
dc.contributor.authorBerg, Ansu
dc.contributor.authorPretorius, Rigardt
dc.date.accessioned2017-02-28T10:56:49Z
dc.date.available2017-02-28T10:56:49Z
dc.date.issued2015
dc.identifier.citationPretorius, L. et al. 2015. Tswana finite state tokenisation. Language resources and evaluation, 49:831–856. [http://dx.doi.org/10.1007/s10579-014-9292-1]en_US
dc.identifier.issn1574–020X
dc.identifier.issn1574–0218 (Online)
dc.identifier.urihttp://hdl.handle.net/10394/20600
dc.identifier.urihttp://dx.doi.org/10.1007/s10579-014-9292-1
dc.description.abstractTswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F$_1$-score of 95 % with respect to a hand tokenised gold standard.en_US
dc.language.isoenen_US
dc.publisherSpringeren_US
dc.subjectFinite state computational morphologyen_US
dc.subjectTswanaen_US
dc.subjectDisjunctive orthographyen_US
dc.subjectTokenisationen_US
dc.subjectVerb morphologyen_US
dc.titleTswana finite state tokenisationen_US
dc.typeArticleen_US
dc.contributor.researchID10203583 - Berg, Anna Susanna
dc.contributor.researchID10067256 - Pretorius, Rigardt Samuel


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record