Establishing the reliability of natural language processing evaluation through linear regression modelling
Eiselen, Ernst Roald
MetadataShow full item record
Determining the quality of natural language applications is one of the most important aspects of technology development. There has, however, been very little work done on establishing how well the methods and measures represent the quality of the technology and how reliable the evaluation results presented in most research are. This study presents a new stepwise evaluation reliability methodology that provides a step-by-step framework for creating predictive models of evaluation metric reliability that take into account inherent evaluation variables. These models can then be used to predict how reliable a particular evaluation will be prior to doing an evaluation, based on the variables that are present in the evaluation data. This allows evaluators to predict the reliability of the evaluation prior to doing the evaluation and adjusting the evaluation data to ensure reliable results. Furthermore, this permits researchers to compare results when the same evaluation data is not available. The new methodology is firstly applied to a well-defined technology, namely spelling checkers, with a detailed discussion of the evaluation techniques and statistical procedures required to accurately model an evaluation. The spelling checker evaluations are investigated in more detail to show how individual variables affect the evaluation results. Finally, a predictive regression model for each of the spelling checker evaluations is created and validated to verify the accuracy of its predictive capability. After performing the in-depth analysis and application of the stepwise evaluation reliability methodology on spelling checkers, the methodology is applied to two more technologies, namely part of speech tagging and named entity recognition. These validation procedures are applied across multiple languages, specifically Dutch, English, Spanish and Iberian Portuguese. Performing these additional evaluations shows that the methodology is applicable to a broader set of technologies across multiple languages.
- Humanities