Show simple item record

dc.contributor.authorGiwa, Oluwapelumi
dc.contributor.authorDavel, Marelie H.
dc.date.accessioned2014-10-01T07:54:19Z
dc.date.available2014-10-01T07:54:19Z
dc.date.issued2013
dc.identifier.citationGiwa, O. & Davel, M.H. 2013. N-gram based language identification of individual words. In: Conference Proceedings of the 24th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA) Pretoria. p.15-22. [http://www.prasa.org/]en_US
dc.identifier.isbn978-0-86970-771-5
dc.identifier.urihttp://hdl.handle.net/10394/11525
dc.description.abstractVarious factors influence the accuracy with which the language of individual words can be classified using n-grams. We consider a South African text-based language identification (LID) task and experiment with two different types of n-gram classifiers: a Näıve Bayes classifier and a Support Vector Machine. Specifically, we investigate various factors that influence LID accuracy when identifying generic words (as opposed to running text) in four languages. These include: the importance of n-gram smoothing (Katz smoothing, absolute discounting and Witten-Bell smoothing) when training Naıve Bayes classifiers; the effect of training corpus size on classification accuracy; and the relationship between word length, n-gram length and classification accuracy. For the best variant of each of the two sets of algorithms, we achieve relatively comparable classification accuracies. The accuracy of the Support Vector Machine (88.16%, obtained with a Radial Basis function) is higher than that of the Naıve Bayes classifier (87.62%, obtained using Witten-Bell smoothing), but the latter result is associated with a significantly lower computational cost. Index Terms: text-based language identification, smoothing, character n-grams, Naıve Bayes classifier, support vector machine.en_US
dc.description.urihttp://www.prasa.org/index.php/2012-03-07-10-55-15
dc.language.isoenen_US
dc.publisherPRASAen_US
dc.titleN-gram based language identification of individual wordsen_US
dc.typeOtheren_US
dc.contributor.researchID23607955 - Davel, Marelie Hattingh


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record