• Login
    View Item 
    •   NWU-IR Home
    • Research Output
    • Faculty of Engineering
    • View Item
    •   NWU-IR Home
    • Research Output
    • Faculty of Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Factors that affect the accuracy of text-based language identification

    Thumbnail
    View/Open
    botha-2012 -factors (127.3Kb)
    Date
    2012
    Author
    Botha, Gerrit R.
    Barnard, Etienne
    Metadata
    Show full item record
    Abstract
    We investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa, using n-gram statistics as features for classification. For a fixed value of n, support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. This is found to be of overriding performance, and a Na¨ıve Bayesian classifier is found to be the best choice of classifier overall. For input strings of 100 characters or more accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy.
    URI
    http://researchspace.csir.co.za/dspace/bitstream/handle/10204/1976/Botha2_2007.pdf?sequence=1&isAllowed=y
    http://hdl.handle.net/10394/26515
    Collections
    • Faculty of Engineering [1136]

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of NWU-IR Communities & CollectionsBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsAdvisor/SupervisorThesis Type

    My Account

    LoginRegister

    Copyright © North-West University
    Contact Us | Send Feedback
    Theme by 
    Atmire NV