Automatic genre classification of English students' argumentative essays using support vector machines
Abstract
Automatic text classification refers to the classification of texts according to topic. Similar to text classification is the automatic classification of texts based on stylistic aspect of texts, such as automatic genre classification, where texts are classified according to their genre. This is the classification task that concerns this research project. *The project seeks to examine the genre of the argumentative essay, in order to develop a genre classifier, using an automatic genre classification approach, which will categorise prototypical and non-prototypical argumentative essays of student writers, into 'good' or 'bad' examples of the genre (binary classification). It is intended that this classifier will allow a senior marker (for example, a lecturer) to give student essays classified 'good' (those that require less feedback and volume of expert correction) to junior markers (for example, teaching assistants). This would afford the senior marker time to pay more attention to essays of a 'poorer' quality. The corpus used for the research project is comprised of 346 argumentative essays drawn from a section of the British Academic Written English corpus and written by LI English students. The data are composed of counts of linguistic features extracted from the texts. Once these features were extracted from the texts they were used to create four data sets: a raw data set, composed of raw feature frequencies, a data set composed of the feature set normalised for text length, a data set composed of inverse document frequency counts, and a data set composed of a logarithmic transformation of the feature frequencies. Various classifiers were built making use of these four data sets, using a machine learning approach. In this way, a classifier is trained on previous examples, in order to predict the class of future examples. The project uses support vector machines in STATISTICAL implementation of support vector machines, the STATISTIC A Support Vector Machine module (Statsoft, 2006). Support vector machine learning is used because this technique has been shown to perform well in automatic genre classification studies and other classification tasks.
Collections
- Humanities [2681]