Factors that affect the accuracy of text-based language identification
Abstract
We investigate the factors that determine the performance
of text-based language identification, with a particular focus
on the 11 official languages of South Africa, using
n-gram statistics as features for classification. For a fixed
value of n, support vector machines generally outperform
the other classifiers, but the simpler classifiers are able to
handle larger values of n. This is found to be of overriding
performance, and a Na¨ıve Bayesian classifier is found
to be the best choice of classifier overall.
For input strings of 100 characters or more accuracies
as high as 99.4% are achieved. For the smallest input
strings studied here, which consist of 15 characters,
the best accuracy achieved is only 83%, but when the languages
in different families are grouped together, this corresponds
to a usable 95.1% accuracy.
URI
http://researchspace.csir.co.za/dspace/bitstream/handle/10204/1976/Botha2_2007.pdf?sequence=1&isAllowed=yhttp://hdl.handle.net/10394/26515
Collections
- Faculty of Engineering [1136]