Show simple item record

dc.contributor.authorBotha, Gerrit R.
dc.contributor.authorBarnard, Etienne
dc.date.accessioned2018-03-05T13:14:39Z
dc.date.available2018-03-05T13:14:39Z
dc.date.issued2012
dc.identifier.citationGerrit Reinier Botha and Etienne Barnard, “Factors that affect the accuracy of text-based language identification”, Computer Speech and Language, Vol 26, No 5, pp 307-320, 2012. [http://engineering.nwu.ac.za/multilingual-speech-technologies-must/publications]en_US
dc.identifier.urihttp://researchspace.csir.co.za/dspace/bitstream/handle/10204/1976/Botha2_2007.pdf?sequence=1&isAllowed=y
dc.identifier.urihttp://hdl.handle.net/10394/26515
dc.description.abstractWe investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa, using n-gram statistics as features for classification. For a fixed value of n, support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. This is found to be of overriding performance, and a Na¨ıve Bayesian classifier is found to be the best choice of classifier overall. For input strings of 100 characters or more accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy.en_US
dc.description.sponsorshipHuman Language Technologies Research Group, Meraka Institute, Pretoria, South Africaen_US
dc.language.isoenen_US
dc.publisherComputer Speech and Languageen_US
dc.subjectText-based language identificationen_US
dc.subjectN-gram statisticsen_US
dc.subjectLanguage Identificationen_US
dc.titleFactors that affect the accuracy of text-based language identificationen_US
dc.typePresentationen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record