N-gram based language identification of individual words
Abstract
Various factors influence the accuracy with which the language of individual words can be classified using n-grams. We consider a South African text-based language identification (LID) task and experiment with two different types of n-gram classifiers: a Näıve Bayes classifier and a Support Vector Machine. Specifically, we investigate various factors that influence LID accuracy when identifying generic words (as opposed to running text) in four languages. These include: the importance of n-gram smoothing (Katz smoothing, absolute discounting and Witten-Bell smoothing) when training Naıve Bayes classifiers; the effect of training corpus size on classification accuracy; and the relationship between word length, n-gram length and classification accuracy. For the best variant of each of the two sets of algorithms, we achieve relatively comparable classification accuracies. The accuracy of the Support Vector Machine (88.16%, obtained with a Radial Basis function) is higher than that of the Naıve Bayes classifier (87.62%, obtained using Witten-Bell smoothing), but the latter result is associated with a significantly lower computational cost. Index Terms: text-based language identification, smoothing, character n-grams, Naıve Bayes classifier, support vector machine.