AUTOMATIC GENRE CLASSIFICATION OF ENGLISH STLTDENTS' ARGUMENTATIVE ESSAYS USING SUPPORT VECTOR MACHINES by Sabrina Raaff Dissertation submitted for the degree Magister Artium in Linguistics and Literary Theory at the North-West University Supervisor: Prof. A.J. van Rooy Co-supervisor: Prof. G.B. van Huyssteen 2008 Potchefstroom Campus PRELIMINARIES i ACKNOWLEDGEMENTS My most sincere thanks go to my supervisor, Bertus van Rooy, for his consistent confidence in me, and understanding. I have really enjoyed working with you. And thanks go to my co-supervisor, Gerhard van Huyssteen, for all his constructive help and suggestions. Thank-you for checking over all my work so thoroughly. Thank-you to the Linguistics Department at Rhodes University, which is very close to my heart, for my love of Linguistics. Especially, Ian Bekker, for inspiring me to come to the PLTK and introducing me to Computational Linguistics. I am especially grateful to Andre for his ever-enthusiastic support, for being truly interested in my explanations of machine learning, for staying up into the small hours of the morning and for his unwavering love; ek is verskriklik dankbaar vir alles. I would like to thank my parents for their support, encouragement, and love in everything I have done. Thank-you for doing all the things for me that I did not have time to do and for providing a warm refuge in times of tiredness and stress. To my family, you are too numerous to list, and I dare not risk leaving anybody out, you have all helped me in some way, I thank you all. To my wonderful friends, thank-you: Ale, Aunty M & Colin, Candz, Carly, Claud, Gary, Gen, George, Hortense & Jacques, Megs C , Megs W., Melanie, The Fruitbowl: Hayls, Leapy, Lol, M, (and honorary members) Aniks, Linds, and Mel. And thanks go to everybody else who inspired me, gave me readings for free and helped me understand all sorts of concepts essential to my research. Sabrina Raaff January 2008 PRELIMINARIES ii ABSTRACT Automatic text classification refers to the classification of texts according to topic. Similar to text classification is the automatic classification of texts based on stylistic aspect of texts, such as automatic genre classification, where texts are classified according to their genre. This is the classification task that concerns this research project.* The project seeks to examine the genre of the argumentative essay, in order to develop a genre classifier, using an automatic genre classification approach, which will categorise prototypical and non-prototypical argumentative essays of student writers, into 'good' or 'bad' examples of the genre (binary classification). It is intended that this classifier will allow a senior marker (for example, a lecturer) to give student essays classified 'good' (those that require less feedback and volume of expert correction) to junior markers (for example, teaching assistants). This would afford the senior marker time to pay more attention to essays of a 'poorer' quality. The corpus used for the research project is comprised of 346 argumentative essays drawn from a section of the British Academic Written English corpus and written by LI English students. The data are composed of counts of linguistic features extracted from the texts. Once these features were extracted from the texts they were used to create four data sets: a raw data set, composed of raw feature frequencies, a data set composed of the feature set normalised for text length, a data set composed of inverse document frequency counts, and a data set composed of a logarithmic transformation of the feature frequencies. Various classifiers were built making use of these four data sets, using a machine learning approach. In this way, a classifier is trained on previous examples, in order to predict the class of future examples. The project uses support vector machines in STATISTICAL implementation of support vector machines, the STATISTIC A Support Vector Machine module (Statsoft, 2006). Support vector machine learning is used because this technique has been shown to perform well in automatic genre classification studies and other classification tasks. Please note that research project or simply project is used solely to refer to the research that this dissertation reports on. PRELIMINARIES iii In light of the practical outcome of the project, the classifier's performance is evaluated in terms of the recall of 'bad' examples. The best results were obtained on the classifier built on the text-length normalised data set, using feature selection, a linear kernel and C = 32. The recall of 'bad' examples in the test set is 62.5 percent, the recall of 'good' examples in the test set is 74.5 percent, and training accuracy is 62.9 percent. This study thus shows that argumentative essays can indeed be classified, using an automatic genre approach and that the differences between the prototypical and non- prototypical essays can be fairly adequately extracted, using linguistic features that are easy to compute. Furthermore, the study confirms good performance of support vector machines, especially if many features are used. Keywords: Automatic Genre Classification/Recognition/Analysis, Automatic Text Classification, Information/Text Retrieval, Corpus Linguistics, Corpora, Computational Linguistics, Automatic Annotation, Machine Learning, Natural Language Processing. PRELIMINARIES IV OPSOMMING Outomatiese teksklassifisering verwys na tematiese teksklassifisering. Dit is soortgelyk aan outomatiese teksklassifisering, gebaseer op stilistiese teksaspekte, soos outomatiese genre-klassifisering, waar tekste volgens genre geklassifiseer word. Hierdie projek ondersoek die genre van navorsingsopstelle (ondersoekende tekste), ten einde 'n genre-Idas sifiseerder te ontwikkel, wat, deur van 'n outomatiese teksklassifiseringsbenadering gebruik te maak, prototipiese en nie-prototipiese navorsingsopstelle van studenteskrywers, as 'goeie' en 'swak' voorbeelde van die genre (binere klassifisering) sal kategoriseer. Die doel is dat sodanige klassifiseerder 'n senior nasiener (byvoorbeeld 'n lektor/lektrise) sal toelaat om studentetekste wat as 'goed' geklassifiseer is (dus min terugvoering en deskundige insette vereis), aan junior nasieners (byvoorbeeld onderwysassistente) toe te vertrou. Die senior nasiener sal sodoende tyd beskikbaar he om meer intensief aandag aan tekste van 'swakker' kwaliteit te skenk. Die versameling geskrewe tekste wat vir hierdie projek gebruik is, bestaan uit 346 navorsingsopstelle uit 'n afdeling van die British Academic Written English Corpus, geskryf deur LI Engelse studente. Die data is saamgestel uit 'n versameling linguistiese kenmerke, wat uit die tekste verkry is. Uit hierdie kenmerke is vervolgens vier stelle data geskep: 'n onverwerkte (rou) stel data, bestaande uit onverwerkte kenmerkfrekwensies; 'n stel data, bestaande uit 'n stel kenmerke, genormaliseer volgens tekslengte; 'n stel data, bestaande uit 'n versameling omgekeerde (teenoorgestelde) dokumentfrekwensies; en 'n stel data, bestaande uit 'n logaritmiese transformasie van die kenmerkfrekwensies. Die vier stelle data is gebruik om verskeie klassifiseerders te ontwikkel deur 'n masjinale (rekenaargebaseerde) leerbenadering gebruik te maak. Op hierdie wyse word die klassifiseerder volgens bestaande voorbeelde geprogrammeer, ten einde die klassifisering van toekomstige voorbeelde te voorspel. Hierdie projek maak gebruik van die ondersteuningsvektormasjien in STATISTIC A se implementering van ondersteuningsvektormasjiene, naamlik die STATISTICA Ondersteunings- vektormasjienmodule (Statsoft, 2006). Die ondersteuningsvektormasjien leerproses is gebruik, aangesien hierdie tegniek reeds goeie resultate in outomatiese genre- klassifisering, asook ander klassifiseringstake gelewer het. In die lig van hierdie PRELIMINARIES v praktiese uitkoms(te) van die projek word die klassifiseerder se prestasie ooreenkomstig die herroeping van 'swak' voorbeelde beoordeel. Die beste resultate is deur die klassifiseerder gelewer, wat met behulp van die genormaliseerde tekslengte datastel ontwikkel is, deur gebruik te maak van kenmerkseleksie, 'n liniere kern en C=32. Die herroeping van 'swak' voorbeelde in die toetsstel is 62.5 persent, die van 'goeie' voorbeelde, 74.5 persent en programmeringsakkuraatheid, 62.9 persent. Hierdie studie bewys dat navorsingsopstelle inderdaad deur 'n outomatiese genrebenadering geklassifiseer kan word en dat die verskille tussen prototipiese en nie- prototipiese tekste redelik voldoende uit die maklik-rekenariseerbare linguistiese kenmerke geidentifiseer kan word. Die studie bevestig verder goeie prestasie deur ondersteunigsvektormasjiene, indien van 'n verskeidenheid kenmerke gebruik gemaak word. Sleutelwoorde: Outomatiese Genre-klassifisering/Erkenning/Analise, Outomatiese Teksklassifisering, Inligtings-/Teksherwinning, Versameling(s) van Linguistiese Tekste (Corpus/Corpora), Rekenaarlinguistiek, Outomatiese Annotasie, Masjinale (Rekenaargebaseerde) Leer, Natuurliketaalprosessering. PRELIMINARIES vi PREFACE When I first began this project, I had taken only a few elementary courses in Computational Linguistics and otherwise had a solid linguistic background, but no knowledge or experience of Mathematics or Natural Language Processing. I, therefore, had a huge amount of catching up to do and simply drowned in the literature for a very long time. In order to help me along the way, and to introduce me to concepts in context, rather than the decontextualised dictionary definitions I was reading up on at the time, I took an undergraduate Mathematics course for a semester. I also learnt how to use Linux, some Perl and came to love regular expressions. Often, when I was stuck on something in a reading I found that it was because the authors assumed their readers knew as much as they did. I found this a major stumbling block especially as very few authors provided references to concepts that there was either no space to explain or which they assumed were known. As a result, I have tried to explain all concepts that may be alien to Linguists or else provided references in footnotes to background information and concepts that are important but not explained. In this way, this dissertation is written for Linguists with no computational or mathematical background, but the approach of this dissertation also addresses it to Computational Linguists. For anyone who is 'lost in the literature', I recommend as a starting point an undergraduate course in Mathematics (to learn about matrices and vectors, complex numbers are also good to learn), Michael Oakes's (1998) Statistics for corpus linguistics, Neil Salkind's (2004) Statistics for people who (think they) hate statistics, and Tony Rietveld and Roeland van Hout's (2005) Statistics in language research: analysis of variance. I would like to thank Amelia Nkosapantsi, Attie de Lange, Elsa van Tonder, Teresa Smit, and Wannie Carstens for all their support during my research period. As well as the National Research Foundation, without whom none of this research would have been funded. PRELIMINARIES vii Many thanks also go to Annick Griebenouw for showing me how to use SVMTool when I was still a Linux and Perl infant, and everybody at the Centre for Text Technology, especially Charlene Gentle, Jacques McDermid Heyns, Martin Puttkammer, and Sulene Pilon who provided lots of help, and patient explanations (and free software!). Thank-you very much, to the wonderful ladies at the library, particularly Gerda van Rooyen for starting me off on the information hunt. I am deeply grateful to the BAWE team for their data and helpful attitude, principally Sian Alsop and Jasper Holmes. Without them, I would not have had ANY data! A huge thank-you to Jesus Gimenez, for educating me about feature sets and encoding. Thank-you also to Sarel Steele for giving me advice and dissertations when all other information sources ran dry. Thanks also go to Liz Greyling for editing this dissertation. And finally, thank-you so much to Google, without 'whom' I simply would never have managed to have the prerequisites to understanding the prerequisites of support vector machines. Sabrina Raaff January 2008 TABLE OF CONTENTS viii TABLE OF CONTENTS ACKNOWLEDGEMENTS i ABSTRACT ii OPSOMMING iv PREFACE vi TABLE OF CONTENTS viii TABLE OF FIGURES xi TABLE OF TABLES xii CHAPTER 1: Introduction 1 1.1 Introduction 1 1.2 Overview of the research area and contextualisation 2 1.3 Problem statement 4 1.4 Research questions 9 1.5 Research aims 9 1.6 Statement of the central hypotheses 10 1.7 Overview of methodology 11 1.8 Chapter outline 13 1.9 Summary 14 CHAPTER 2: Literature review 15 2.1 Introduction 15 2.2. Defining genre 16 2.3. Overview of previous automatic genre classification studies 20 2.3.1 Overview of seminal works in genre classification studies 21 2.3.2 Overview of contemporary genre classification studies 28 2.3.3 Overview of genre classification studies that make use of support vector machines 47 2.4 Summary 56 CHAPTER 3: Developing the classifier 59 3.1 Introduction 59 3.2 The corpus 60 3.3 The features 61 3.3.1 Parts-of-speech 61 3.3.2 Punctuation marks 62 3.3.3 Quotations 62 3.3.4 Nominalisations 63 3.3.5 Text statistics 65 3.3.5.1 Word count 65 3.3.5.2 Word length 65 3.3.5.3 Long words 65 3.3.5.4 Type/token ratios (TTR) 66 3.3.5.5 Sentence count 67 3.3.5.6 Sentence length in words 67 3.3.5.7 Paragraph count 67 3.3.5.8 Paragraph length in sentences 68 3.3.5.9 Readability scores 68 3.3.6 Word lists 69 3.3.6.1 Key function words 69 3.3.6.2 The top fifty most frequent words in the BNC 69 TABLE OF CONTENTS ix 3.3.6.3 Prepositions 70 3.3.6.4 Reporting verbs 70 3.3.6.5 Conjunction 71 3.3.6.5.1 Conjunctive adjuncts (conjuncts) 71 3.3.6.5.2 Coordinating conjunctions 72 3.3.6.5.3 Subordinating conjunctions 72 3.3.6.6 Hedges 72 3.3.6.7 Downtoners 73 3.3.6.8 Stance adverbs 73 3.3.6.9 Stance adjectives 74 3.3.6.9.1 Stance adjectives controlling that-clauses 74 3.3.6.9.2 Stance adjectives controlling to-clauses 74 3.3.6.10 Nouns 75 3.3.6.10.1 Stance nouns taking that-clmses 75 3.3.6.10.2 Nouns taking to-clauses 75 3.4 Text preparation before feature extraction 75 3.5 Annotation of the corpus 78 3.5.1 POS tagging 78 3.5.2 XML tagging 80 3.6 Data preparation before classification 81 3.7 Support vector machines 88 3.7.1 The learning problem 89 3.7.2 The optimal hyperplane classifier and the hard margin classifier 91 3.7.3 Kernels 96 3.7.4 The soft margin classifier 98 3.8 Summary 101 CHAPTER 4: Training, evaluation and interpretation 102 4.1 Introduction 102 4.2 Training the classifier, using support vector machines 103 4.2.1 Parameter selection 103 4.2.2 Potential data problems: imbalanced data sets, misclassification costs and the normal assumption 106 4.3 Evaluation indicators and metrics 109 4.4 Results and discussion 112 4.5 Analysis of the classifier's performance 124 4.6 Summary 129 CHAPTER 5: Conclusion and recommendations 133 5.1 Introduction 133 5.2 Summary of chapters 133 5.3 Summary of results and findings 137 5.4 Recommendations for further research 145 5.4.1 Features 145 5.4.2 Evaluation 147 5.4.3 Learning technique 148 5.5 Summary 148 REFERENCE LIST 150 APPENDIX 1: Corpus information 167 Al. l Subjects 167 A1.2 Essays used in this project 167 APPENDIX 2: Linguistic features 179 TABLE OF CONTENTS x A2.1 Parts-of-speech 179 A2.2 Punctuation marks 186 A2.3 Quotations 186 A2.4 Nominalisations 187 A2.5 Text statistics 187 A2.6 Key function words 188 A2.7 Most frequent words in the BNC 188 A2.8 Prepositions 189 A2.9 Reporting verbs 191 A2.10 Conjunctions 192 A2.ll Downtoners 193 A2.12 Stance adverbs 194 A2.13 Stance adjectives 195 A2.14 Nouns 197 APPENDIX 3: Data preparation and annotation 199 A3.1 Data cleaning 199 A3.2 SVMTool 199 APPENDIX 4: Support vector machines 200 A4.1 The optimal hyperplane classifier and the hard margin classifier 200 A4.2 The soft margin classifier 204 APPENDIX 5: Feature selection 209 NOTATION AND ABBREVIATIONS 212 TABLE OF FIGURES xi TABLE OF FIGURES Figure 1.1: Process of developing the classifier 12 Figure 2.1: A model of learning from examples 15 Figure 2.2: Framing the genre classification task in terms of prototype 19 Figure 3.1: Illustrating the genre classification task, 'average' examples 83 Figure 3.2: A separating hyperplane 92 Figure 3.3: A maximal margin hyperplane that perfectly separates St, with its support vectors (SVs) 95 Figure 3.4. Feature mapping 0: X —» F 97 Figure 4.1: Matrix showing classification and misclassification terms 107 TABLE OF TABLES xii TABLE OF TABLES Table 4.1: Results on different data sets for SVM training using the best parameter values 113 Table A 1.1.1: List of departments/ courses from which essays used are drawn 167 Table Al.2.1: List of essays from the BAWE and their grading in percentage 178 Table A2.1.1: List of Penn Treebank part-of-speech tags 181 Table A2.1.2: List of UCREL CLAWS7 part-of-speech tags 186 Table A2.2.1: List of punctuation tags 186 Table A2.3.1: List of quotation tags 187 Table A2.4.1: List of nominalisational suffixes and their respective tags 187 Table A2.6.1: List of the key function words of the top 1000 key words 188 Table A2.7.1: List of the top fifty words in the written section of the BNC 188 Table A2.8.1: List of simple prepositions 189 Table A2.8.2: List of two-word complex prepositions 190 Table A2.8.3: List of three-word complex prepositions 191 Table A2.9.1: List of public factual verbs 191 Table A2.9.2: List of private factual verbs 192 Table A2.9.3: List of suasive verbs 192 Table A2.9.4: List of miscellaneous reporting verbs 192 Table A2.9.5: List of perception verbs 192 Table A2.10.1: List of conjunctive adjuncts/conjuncts/linking adverbials 193 Table A2.10.2: List of multi-word conjunctive adjuncts/conjuncts/linking adverbials 193 Table A2.10.3: List of subordinating conjunctions 193 Table A2.10.4: List of coordinating conjunctions 193 Table A2.11.1: List of downtoners 194 Table A2.12.1: List of non-factual stance adverbs 194 Table A2.12.2: List of factual stance adverbs 194 Table A2.12.3: List of two-word factual stance adverbs 194 Table A2.12.4: List of likelihood stance adverbs 194 Table A2.12.5: List of attitudinal stance adverbs 194 Table A2.13.1: List of attitudinal stance adjectives 195 Table A2.13.2: List certainty/factual stance adjectives 195 Table A2.13.3: List of likelihood stance adjectives 196 Table A2.13.4: List of certainty stance adjectives 196 Table A2.13.5: List of ability/willingness stance adjectives 196 Table A2.13.6: List of personal affective stance adjectives 196 Table A2.13.7: List of ease/difficulty stance adjectives 197 Table A2.13.8: List of evaluation stance adjectives 197 Table A2.14.1: List of factual stance nouns 197 Table A2.14.2: List of likelihood stance nouns 197 Table A2.14.3: List of non-factual stance nouns 197 Table A2.14.4: List of attitudinal stance nouns 198 Table A2.14.5: List of controlling nouns 198 Table A3.1.1: List of terms with multiple occurrences in the word lists 199 Table A5.1: List of features selected 212 INTRODUCTION 1 CHAPTER 1 Introduction Composition, rhetoric and argumentation have traditionally played a key role in Western education, and argumentation in particular is valued highly for its associations with the concept of logical thinking, proofs and refutations (English, 1999:17) 1.1 Introduction Researchers distinguish between text categorisation and text classification (Jackson & Moulinier, 2002:119), where text categorisation generally refers to document sorting by content and topic1 (Manning & Schiitze, 1999:575), and text classification to any document classification not necessarily based on content, such as classification by author. In the literature, however, this distinction is normally established through explanation rather than terminology. In this research project, the automatic classification of texts according to topic is termed automatic text classification. Similar to text classification is the automatic classification of texts based on some stylistic aspect of texts; examples of such stylistic classification are authorship attribution studies (Mosteller & Wallace, 1964) and genre classification. This project is concerned with this latter classification, based on genre, which is referred to as automatic genre classification. Genre is defined at length in Chapter 2; in summary, it refers to a class of communicative events, in which the participants share some communicative purpose(s). This common purpose determines the discursive structure, style and content of a genre. Exemplar members of a genre thus demonstrate patterns of similarity with regard to structure, style, content and intended audience (Swales, 1990:58). Automatic genre classification has its niche in effective webpage searching, seeking to create a web-search engine augmented with a genre identification module (Kwasnik, Crowston, Nilan, & Roussinov, 2000). Ultimately, users could specify genre type according to their information needs, which would ensure higher precision in retrieval and higher relevancy to the user. Automatic genre classification is thus largely associated with information retrieval. As a result, it is mainly used to distinguish between web-specific genres, such as FAQs (frequently asked questions), where the In this project topic, subject, and domain are used as synonyms. CHAPTER 1 2 corpora are collected from the web by the researchers. It has, however, also been applied with varying success to the classification of traditional2 genres, which are generally drawn from existing corpora, such as the Lancaster-Oslo-Bergen Corpus of British English (Johansson, Leech & Goodluck, 1978). For the most part, these studies are concerned with English corpora, but some work has been undertaken in German, Greek, Korean, Russian and Swedish (see for example Wastholm, Kusma, & Megyesi, 2005; also Stamatatos, Fakotakis, & Kokkinakis, 2000a). These studies vary widely in application, features and training methods, but have in common determining the best features for classifying genres. It is with these studies, which seek to automatically classify traditional genres, that this research project is concerned. Chapter 1 serves as an introduction to the automatic genre classification task of this research project. The chapter commences with a brief overview of the research area and the contextualisation of this research project, in Section 1.2. Next, the problem statement is described in Section 1.3. Then the research questions are provided in Section 1.4. The research aims of this project follow in Section 1.5. Thereafter, the statement of the central hypotheses is presented in Section 1.6. Then an overview of the methodology is provided in Section 1.7. Finally, the chapter concludes with the chapter outline showing the structure of this dissertation, in Section 1.8. 1.2 Overview of the research area and contextualisation A distinction is drawn between the genre classification of web-specific and traditional genres in Section 1.1. This project is concerned with traditional genres in particular. Automatic genre classification studies that are concerned with traditional genres are characterised by three main concerns: corpus, features and learning methodology. The corpus provides the texts that are to be classified as well as the data, which are the basis of classification. The data are in the form of feature frequencies extracted from the texts. This feature set is predetermined based on the hypothesised characteristics of a particular genre and also on features that have been found useful in other studies. The 2 These genres are also referred to as paper or print genres. These terms are not used here as they place too much emphasis on medium. 3 Known as the LOB corpus. INTRODUCTION 3 choice of these features must be undertaken with care, in order to reduce the likelihood of many irrelevant features. This is because, in general, many learning methodologies are adversely affected by too many irrelevant features. In Chapter 4, it will be seen that this is something that the learning technique used in this project is fairly robust to. The features are mainly of two types: lemmas (and sometimes, word-forms) and 'linguistic' features. The former set is more traditionally used in automatic text classification studies and is known as the bag-of-words (BOW) approach. The approach is hypothesised as and sometimes found to be less useful in automatic genre classification studies as it is too topic specific (Finn, 2002:75). That this is so is the main concern of many studies, which aim to show that 'linguistic' features are rather more useful in distinguishing genres and should rather be used. The word 'linguistic' is used because it is not altogether clear that the BOW approach is 'unlinguistic'. Indeed, it is evident in several studies (Argamon & Dodick, 2004a; also Santini, 2005b) that both the feature sets can be used well together, and moreover, that the BOW approach can be used successfully, but with more careful word selection. Once the features have been determined, they are extracted from the texts using various techniques, for example, regular expressions to match search terms and extract frequency counts. Then the hypothesised differences between genre classes are sometimes explored in terms of these feature counts. This is done, in order to estimate a more accurate idea of the discriminatory ability of particular features and thus, their subsequent use in training a genre classifier. Such intermediate exploration of features is also useful in the removal of irrelevant features before training. Once the final feature set has been determined, a classifier is developed on a training set from which it learns to make future classifications. Phrased statistically, the classifier is trained on multiple independent variables (the features), which predict the dependent variable (the genre class). The classifier is then tested on a set of texts that it has not observed before. It is desirable to achieve high accuracy on the training set because based on this accuracy it can be deduced that the features used adequately extract the differences between particular genre classes. Additionally, it is desirable to achieve high accuracy on the test set because this shows how well the classifier performs on unseen CHAPTER 1 4 data and how well it can be expected to perform on new test sets. This is referred to as the classifier's generalisation ability. If the classifier is to be re-usable it must have a high generalisation ability. An accuracy that is too high on the training set can, however, indicate that, although the training set can be perfectly or near perfectly classified, the classifier is too attuned to the idiosyncrasies of the training set. As a result, it fits the training set very well but does not generalise well. This situation is referred to as overfitting because the classifier overfits the training set. The opposite, and equally undesirable situation is called underfitting, as the classifier underfits the training set. Various learning methodologies are used, such as factor analysis (Biber, 1988), discriminant analysis (Karlgren & Cutting, 1994), ^-nearest neighbour (Wolters & Kirsten, 1999), multiple regression (Stamatatos, Fakotakis, & Kokkinakis, 2000b), logistic regression (Boese, 2005), decision-tree learning (Finn, 2002), Naive Bayes (Santini, 2004a), and support vector machines (Argamon & Dodick, 2004a). These techniques are all explained and reviewed in context, in Chapter 2. 1.3 Problem statement The norms and expectations of academic discourse introduce concerns for both students and lecturers. These concerns revolve around elucidating precisely what the attributes of academic writing and the discourse ideals are, for informing assessment and for informing student academic writing; the former is the main concern of this project. Academic writing is regarded as an essential means of communication in tertiary education and determines students' success at tertiary institutions. Hyland (2004a:5) defines successful academic writing as "the ability of writers to offer a credible representation of themselves and their work, by claiming solidarity with readers, evaluating their material and acknowledging alternative views". There are three main approaches to academic writing: the skills-based approach, the acculturation approach and the practice-based approach (Lea & Street, 1997, cited in Lea & Street, 2000). The first approach considers that there is a set of skills applicable to all academic disciplines, and that can be learned and transferred to any academic context (Lea & Street, 2000:34). The second approach Lea and Street (1997, cited in Lea & Street, 2000) term the "academic socialisation approach", in which the task of the lecturer is viewed as one INTRODUCTION of socialising students into a new 'culture' (Lea & Street, 2000:34). The third approach is referred to as the "academic literacies approach" (Lea & Street, 1997, cited in Lea & Street, 2000). In contrast to the first two approaches, the third approach does more than merely acknowledge disciplinary and departmental differences in academic literacy practices. This approach views academic institutions as sites of "discourse and power" (Lea & Street, 2000:35) and academic literacies as social practices. It thus views academic literacy as encompassing a variety of communicative practices, which includes different fields and genres. Furthermore, it sees each communicative practice in context where social meanings and identities are evoked. This research project takes this third approach in that it acknowledges the differences between communicative practices, in particular those of various academic genres. Moreover, this project views student writing, as academic writing, in terms of meaning- making and ideological conflicts (Davidson & Tomic, 1999; Turner, 1999; also Ivanic, Clark & Rimmershaw, 2000). As a result (as will be shown in Section 2.2) 'good' and 'bad' examples of the argumentative essay are labelled as such not because they indicate skills or a deficit of skills but rather because they are not in keeping with the discourse ideals of the gatekeepers, as can be seen from the grade awarded them. This research project seeks to examine the genre of the argumentative essay. These essays are written by students within an academic context. According to Van de Poel (2006:17), a particular academic context, in which academic writing takes place is constructed from: (A) a limited repertoire of text genres; (B) an author who is defined as an academic in some way, e.g. a lecturer or student; (C) a main goal that is to render a point of view about an academic topic; (D) an objective and argumentative way of writing; and (E) a set of conventions regarding referencing and layout. Furthermore, an academic text bears the following characteristics (Van de Poel, 2006:18): (A) It is well embedded in an academic context. (B) Its point of departure is a thesis or a research question. (C) It intends to persuade the ideal audience. (D) It delivers the author's personal view with respect to the central tenet of the text. CHAPTER 1 6 (E) It is written by an author who is not necessarily made prominent. (F) It contains standardised formal characteristics. This research project contests the last point as it implies that all academic texts have the same formal characteristics. The project extracts various formal characteristics of one genre of academic writing, in order to determine whether linguistic features can be used to classify texts, even within one genre. Texts representative of this genre, argumentative essays, serve to confirm or reject a thesis statement, or to persuade the reader of the writer's point of view, and as such are defined as instances of argumentative writing (Van den Poel, 2006:75). This ability to argue based on facts and examples, reason and consequence, authority, subjective judgement and deliberation of pro and cons (Van den Poel, 2006:80) is considered valuable in Western education (English, 1999:17). Therefore, it is essential that students learn to argue in writing, in order to succeed in many academic discourse communities. Intuitively, it follows that evaluative feedback plays an important role in acquiring this knowledge. An automated feedback system would provide an opportunity for lecturers to provide more detailed feedback in a shorter period of time. A starting point of this type of evaluation is an automated means of determining the standard of students' essays. To this end, a program that can analyse the presence of features indicative of proficient academic writing would provide a means for lecturers to pay more attention and time to students who struggle in their writing. An example of a program with similar aims is Trushkina (2006), which automatically detects lower-level language errors in L2 English learners' argumentative essays, in order to allow lecturers time to focus on higher-level phenomena (see also Louw, 2006). In addition, such a program could inform lecturers as to the particular attributes of the genre that require attention on the part of both the learner and lecturer. This research project is mainly concerned with the former goal of such a program, but also sheds some light on features of the genre at hand. Such a task is one of binary classification, where the essays are grouped into two classes: 'good' or 'bad' examples of the argumentative essay genre. Essays that output 'good' were considered indicative INTRODUCTION 7 of a student who has successfully acquired the norms of academic writing within the genre of argumentative essays. Such a system would separate essays needing less feedback and volume of expert correction (classified 'good') from those needing more attention (classified 'bad'). In this way, the system would allow a senior marker (for example, a lecturer) to give student essays classified as 'good' examples of the genre to junior markers (for example, teaching assistants). This would afford the senior marker time to pay more attention to essays of a 'poorer' quality. This classifier could even be biased to classify texts as 'bad', rather than 'good', in cases of uncertainty to ensure that essays labelled 'bad' examples are not given to junior markers who may not be able to provide the kind or volume of feedback required. Determining the approach to this type of feedback system requires some reframing of the problem at hand. This involves putting forth some hypotheses regarding the nature of the classification task. These hypotheses are detailed in Section 1.6. The approach this research project takes, is one of automatic genre classification. Major studies in this field (to be reviewed in Chapter 2) reveal that this approach has not been applied to so subtle a genre class as argumentative essays. Much of the work in pedagogy within corpus linguistics compares non-native speaker corpora with native speaker corpora, in order to compare patterns of use of lexis and grammatical structures (Flowerdew, 2002:98). Examples are Granger and Rayson (1998), who compare word frequency profiles from the International Corpus of Learner English (ICLE), a corpus of argumentative essay writing by advanced non-native learners, to a control corpus from the Louvain Corpus of Native English Essays (LOCNESS), as well as Hyland and Milton (1997), who investigated native speaker and non-native speaker high school students' argumentative academic writing in terms of expression of doubt and certainty. This project, however, does not seek to compare the differences in argumentation between non-native speakers and native speakers but rather to determine 'good' and 'poor' examples of argumentative essays within a group. The project is concerned with CHAPTER 1 8 the argumentative writing of native speakers, because then it is likely that there will be fewer minor errors (spelling and morphological errors, see Trushkina, 2006:155), making it easier to extract linguistic information, such as part-of-speech (POS) tags, which rely on correct language structure to achieve high accuracy (such errors characterise, for example, the Tswana Learner English Corpus compiled at North-West University, South Africa). In addition to the automatic genre classification approach, there are other natural language processing approaches to the problem addressed by this research project (see for example, Teufel & Moens, 1999; also Buckingham Shum, Uren, Li, Domingue & Motta, 2002). Two relevant examples of such an approach are Moreale and Vargas- Vera's (2003) automated argument extraction tool, and Burstein, Marcu, Andreyev, and Chodorow's (2001) thesis statement classifier. Moreale and Vargas-Vera's (2003) automated argument extraction tool (similar to this research project) is concerned with argumentation in students' essays. This tool is not of a classificatory nature, rather it seeks to categorise and highlight argumentative strategies in students' essays. Thus, the output of the tool is intended to assist students in evaluating their own work (formative) and as a supplementary tool for marking (summative). Burstein, Marcu, Andreyev, and Chodorow's (2001) thesis statement classifier seeks to identify the thesis statement in essays. Unlike Moreale and Vargas-Vera's (2003) tool, this classifier is not an end- product, but the creators suggest that the features of a particular essay's thesis statement could be of evaluative use to the writer of the essay (Burstein et al, 2001:98). As argumentation in academic writing is valued, it determines students' success at tertiary institutions. It is thus essential for students to learn to argue in writing, in order to succeed in many academic discourse communities. Feedback plays an important role in acquiring this knowledge. This project aims to develop a classifier that will ease the workload of senior markers, in order to allow them additional marking time, thereby allowing them to provide higher quality feedback. INTRODUCTION 9 1.4 Research questions The following research questions arise from the preceding discussion: 1. What are the most discriminating linguistic features between 'good' and 'bad' examples of the argumentative essay genre? 2. Can these linguistic features be easily computed and extracted? 3. Can an automatic genre classification approach be used to develop a classifier, which will categorise prototypical and non-prototypical argumentative essays of student writers, into 'good' or 'bad' examples of the genre? 4. Will support vector machines (SVMs), as a machine learning technique, provide good generalisability, especially across domains, while requiring the least amount of human effort? 1.5 Research aims In response to the research questions, this project aims to: 1. Establish the most discriminating features between 'good' and 'bad' examples of the argumentative essay. 2. Determine whether these features can be easily computed and extracted. 3. Develop a classifier using an automatic genre classification approach, which will categorise prototypical and non-prototypical argumentative essays of student writers, into two classes: 'good' or 'bad' examples of the genre. 4. Determine whether SVMs will provide good generalisability, especially across domains, while requiring the least amount of human effort. CHAPTER 1 10 1.6 Statement of the central hypotheses This research project posits several hypotheses regarding the approach towards the classification task. According to Grabe and Biber (1987, cited in Biber, 1988:204), student essays use the surface form of academic prose, but are relatively non- informational and extremely persuasive. They therefore deduced that student essays "do not have a well-defined discourse norm in English" (Grabe & Biber, 1987, cited in Biber, 1988:204). 1. The first hypothesis is in reaction to this. It is hypothesised that there are computationally extractable differences between argumentative essays in a higher-grade band ('good') and those in a lower-grade band ('bad') that can be used to predict the classes of new essays. 2. Second, these class differences can be adequately represented by linguistic features that are easy to compute and extract. 3. Third, the differences between essays that place them in a higher- or lower-grade band are indicative of the prototypicality of the essays; therefore, 'good' essays can be viewed as prototypical and 'bad' essays as non-prototypical. 4. Fourth, this prototypicality of argumentative essays can be extended to a genre class so that this classification task can be viewed as one of genre classification. In this case, 'good' essays are instances of the argumentative essay genre, while 'bad' essays, although still examples of the genre are poor instances of the genre. 5. Accordingly, it is then hypothesised that previous automatic genre classification studies can be used to inform this project in terms of features and methodology. This is supported by the fact that the feature set in major automatic genre classification studies remains fairly constant; possibly because they have their origins in Biber's (1988) language variation study (see Chapter 2). This is a positive indicator for this project as it implies that features that have worked well in other projects can be used with some confidence in this research project. It can also be deduced that other aspects of these studies can be used to guide this research project, such as evaluation metrics. INTRODUCTION 11 1.7 Overview of methodology Initially, a literature review of the field of automatic genre classification was conducted, in order to determine the standards of practice in terms of applications, corpora and genres, learning techniques, features, and evaluation metrics. Thereafter, nine main steps were followed, in order to develop the classifier. The first step was to select the machine learning technique (the algorithm). The best machine learning technique was determined by the literature review on automatic genre classification as well as a review of machine learning. The second step was to identify and acquire a corpus from which to extract the features. The third step was to choose the features that were to be used. The features were chosen based on the literature review of automatic genre classification and two well-known grammar books: Biber, Johansson, Leech, Conrad, and Finegan (1999); and Quirk, Greenbaum, Leech, and Svartvik (1985). The fourth step was to prepare the texts before the features were extracted. This preparation included the removal of formatting, essay questions, essay titles, bibliography, appendices, headings, footnotes, graphs, illustrations, tables, some of the punctuation and equations. It further entailed character set conversion,4 the standardisation of apostrophes and quotation marks, and tokenisation. The fifth step was to mark-up sentences, paragraphs, quotations, references, punctuation marks, nominalisations, two- and three-word complex prepositions, two-word adverbs, and multi-word conjuncts using XML tags, with part-of-speech (POS) tags. The sixth step was to extract the features using STATISTICAL text miner, STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006). The seventh step was to standardise the essays' grades (the dependent variable), to remove multiple occurrences of features from the data set (data cleaning) and to transform the data in three ways. This step also entailed reducing the feature set using feature selection tests. In order to determine which feature selection tests to use the features were assessed for normality using four descriptive methods. The eighth step was to train the classifier using STATISTICA SVM (Statsoft, 2006).5 The final step was to test the SVM classifier. The process of developing the classifier is illustrated in Figure 1.1 below. 4 The texts were converted from Unicode to ISO/IEC 8859-1. 5 This tool will be described in Section 3.7. CHAPTER 1 12 Select the machine learning technique Identify and acquire a corpus Choose the features Prepare texts before feature extraction Annotate the corpus Extract the features Prepare data before classification Train the classifier Test the classifier Figure 1.1: Process of developing the classifier This project made use of SVMs for classification not only because this technique has shown good performance in a variety of pattern classification problems (Burges, 1998:121, see also Scholkopf & Smola, 2002:22), but also because it has been shown to have good performance in automatic genre classification studies (this is discussed in Chapter 2). It thus seems reasonable to assume that SVM learning is one technique that can be expected to perform well for the problem posed by this research project. In addition to selecting a technique that is not necessarily generally the best technique for various problems, but at least one of the better techniques for this problem, it is also important to determine how good performance is to be determined and measured (Hand, 1997:3), that is, what is meant by good for this project? In light of the practical outcome INTRODUCTION 13 of this classification project, as reviewed in Section 1.3, the classifier's performance is evaluated in terms of the recall of 'bad' examples. This metric provides a measure of the number of 'bad' examples that are correctly labelled 'bad' (see Chapter 4 for a detailed review of various evaluation metrics). Measuring performance by the number of 'good' examples that are correctly labelled 'good' is not as important, because 'good' essays being incorrectly labelled 'bad' would not be as detrimental as 'bad' essays being incorrectly labelled 'good'. 1.8 Chapter outline In Chapter 2, the notions of machine learning and supervised learning will be defined, and the basic notation for the machine learning process used in this project is introduced. These concepts will be placed in the framework of automatic genre classification and the task of genre classification for this project further explained. Next, the use of genre in this project will be defined with particular reference to the genre of this project. Thereafter, a review of automatic genre classification will be presented, in order to detail the background to the features and methods used in this research project. Where possible, comparisons will be made between these projects, and each statistical technique used in these studies explained. Furthermore, these studies will be critically assessed in terms of the validity of pre-defined genre classes, results, evaluation measures and the features used for genre extraction. This will be done to determine the potential value of features and methodology for application to this research project. The literature review will first review seminal works in the field. Thereafter, contemporary automatic genre classification studies will be reviewed with detailed reference to projects that are relevant to this research project, with regard to application, corpus, features, or method. Finally, studies that use SVMs for the purposes of genre classification will be reviewed. Chapter 3 will provide detailed background on the data and learning methodology used to develop the genre classifier. This chapter will discuss all the features deemed potentially relevant as good predictors of prototypical or non-prototypical examples of argumentative essays. Thereafter, text preparation before feature extraction will be detailed. Next, the annotation of the features will be described. Thereafter, data CHAPTER 1 14 preparation before classification will be detailed. Lastly, SVMs for the linearly separable and non-linearly separable case will be presented.6 Chapter 4 will detail how the data were used in training the SVM classifier, and the method used in the selection of the training parameters. The potential data concerns of imbalanced data sets, differing misclassification costs, and the normal distribution assumption of the data set will be raised and addressed in terms of this research project. Next, various evaluation indicators and metrics will be presented, and the most suitable accuracy measure for this project will be discussed. Thereafter, the results of the various classifiers built on different data and feature sets, using C- and v- SV classification, and two kernels will be reported. This chapter will also address various hypotheses, some of which are raised in Chapter 3. Finally, the best classifier's performance will be analysed and seven potential reasons put forth for the results. In Chapter 5, the dissertation will be concluded with a summary of the preceding chapters. Furthermore, the results and findings of this study will be reviewed, with reference to the hypotheses postulated in Chapters 3 and 4. Thereafter, recommendations for future research will be made. 1.9 Summary This chapter provided an overview of the background to this research project. First, an introduction to automatic genre classification and the research area was provided. Thereafter, the problem statement was described and the central hypothesis stated. Then, the research questions arising from the problem statement and the corresponding research aims of this project were delineated. Next, an overview of the methodology of this research project was outlined. Finally, the chapter outline showing the structure of the following sections was sketched. Chapter 2 will define genre and present a review of major studies in the field of automatic genre classification. 6 SVMs for data that can be separated in space by a line, and data that cannot be separated by a line. LITERATURE REVIEW 15 CHAPTER 2 Literature review The word [genre] is highly attractive — even to the Parisian timbre of its normal pronunciation — but extremely slippery (Swales, 1990:33) 2.1 Introduction Machine learning is a technique used to 'teach' a program (referred to as the learner) the features of the classes it must learn to classify. The goal is for the program to be able to extend the 'knowledge' gained by training to unseen data, in order to classify this new data into the classes defined during training. Vapnik (2000:19-20) represents this kind of learning by a "model of learning from examples". This model is illustrated in Figure 2.1 below; where G is the generator of the data, S is the target operator or supervisor's operator, and LM is the learning machine. G X s y LM y Figure 2.1: A model of learning from examples (Vapnik, 2000:20) During learning, the LM observes the training set, pairs [x, y). Once the LM has been trained it must be able to return a value y* for any given JC. It is intended that such a y* value approximates S 's y response. For this genre classification task JC represents the essay and y the classification label, 'good' (example) or 'bad' (example). Naturally, in order for a learner to learn, there must be a 'teacher'. 'Teaching' is referred to as supervision. There are differing degrees of supervision, ranging from supervised to unsupervised learning, in which the amount of human intervention involved is minimal. CHAPTER 2 16 Essentially, supervision refers to the degree and types of annotation of data, as well as the amount of information (in the form of instructions) the computer is given regarding the classification task; such as what data must be classified, how the data must be classified and into which classes the data must be classified. Machine learning can, of course, be used for many other learning problems that do not require explicit classification as an end-product. For the purposes of this research project, however, the introduction provided above depicts the type of machine learning this project is concerned with; classification of students' essays into prototypical ('good') and non-prototypical ('bad') examples of argumentative essays. This chapter defines such automatic genre classification in Section 2.2. This section does not provide a detailed overview of the different uses of genre, but rather outlines the background to the definition of genre assumed in this study, and further elucidates what is meant by genre. It should be noted that not all researchers in the field of automatic genre classification assume the same definition of genre. This is further clarified in Section 2.3, which provides an overview of the state-of-the-art in automatic genre classification, with particular emphasis on genre classification studies that make use of SVMs. This section reviews the features, methods, and results of the automatic genre classification systems of previous work in the field. It also briefly discusses the types of problems that can be solved, using the approach and techniques of automatic genre classification. 2.2 Defining genre Originally, genre referred to a kind of picture, which depicted a scene from ordinary domestic life, and became extended in usage to refer to classes of articles (Swales, 1990:33). An overview of the term's development in folklore, literary studies, and rhetoric is provided by Swales (1990, see also Hyland, 2004b:25-50). This section, however, is concerned with the use of the term in linguistics. This usage is similar to the meaning Swales (1990) intends when using the term genre, as does the ethnographer, Saville-Troike (1982) who lists greetings, lectures and jokes as some examples of genre types (Swales, 1990:39). LITERATURE REVIEW 17 The use of genre adopted by this project follows that of Swales and other Hallidayean linguists. In order to explain this use of genre some reference must be made to register. Register is analysed according to field, tenor and mode. Field refers to the content and type of activity involved; tenor refers to the role, relationships, and status of the participants; and mode refers to the channel of communication (Swales, 1990:40). Collectively, field, tenor and mode act as "determinants of the text through their specification of register" (Halliday, 1978:122). In this way, according to Martin (1985), genres are realised through registers, and registers themselves are realised through language (see also Lee, 2001:46). Martin (1985) provides similar examples to those of Saville-Troike of genre types: lectures, seminars, poems, narratives and manuals (Martin, 1985:250). Genre thus determines the way field, tenor and mode can be combined in any linguistic situation, in any particular culture (Swales, 1990:41). This last remark is important as genre types are not the same in all cultures. Therefore, deconstructing the norms of genre types can be helpful for cross-cultural awareness and education of, for example, students learning the rules and structure of argumentative essays. Furthermore, Martin's (1985) view of genre leads to an analysis of discourse structure, which looks at the beginning, middle and ending of a text. These stages of development also separate register from genre in that register can be identified at the sentence-level, whereas genre can only be realised in completed texts. Accordingly, genre determines "the conditions for beginning, continuing and ending a text" (Couture, 1986:82). As examples of genre, Couture (1986:87) offers the research report and business report, and as examples of register, the language of scientific reporting and the language of newspaper reporting. In the case of this study, the register being used (or rather the target register) is the language of academic writing and the genre of the argumentative essay. Genres and registers are often complementary and, according to Couture (1986:86), successful textual communication may require demonstration of the appropriate relationship between the genre and register systems. In this research project, it is assumed that for the students to acquire a 'good' mark for their essays, they will need to CHAPTER 2 18 demonstrate their acquisition of the norms of the language of academic writing, and the structural rules of the genre: the argumentative essay. The usefulness of genre analysis and classification has at times been questioned, and accused of leading to "heavy prescription and slavish imitation" (Swales, 1990:38). After reviewing the attitudes towards and the use of genre in the disciplines of folklore, literary studies, linguistics and rhetoric, Swales (1990) demonstrates some commonalities in the stance of academics in these disciplines. From this, he deduces that contrary to what he terms "ancient misapprehensions" (Swales, 1990:37), genre theory can indeed be useful for educating students without resorting to "narrow prescriptivism" (Swales, 1990:45). Moreover, educating students about genres can illuminate reflections upon linguistic and rhetoric choices for students as writers, rather than deny them such opportunities of choice in structuring their writing (Swales, 1990:45). In attempting to establish a working definition of genre, Swales discusses genre membership. This leads to the questioning of what it is that determines membership of any particular genre. He proposes two ways of determining the answer to this: the definitional approach and family-resemblance approach (Swales, 1990:49). The definitional approach requires drawing up a limited set of simple properties that would define all and only the members a particular genre from anything else (Swales, 1990:49). He provides many counters, with examples, to this approach, which will not be detailed here, the essence of which is that this approach is often difficult to accomplish in the case of genre types. The next approach, family resemblance, is concerned with similarities and relationships between members of a group as opposed to a set of limited properties. The family- resemblance approach, as proposed by Wittgenstein (1953:31), lead to prototype theory. Prototype theory is associated with Rosch (1975); it examines members of classes along a continuum of least typical to most typical. The member that is established as most typical is the prototype of that class. In terms of this project this means that although the essays are instances of the argumentative essay genre, not all are typical members. Rather, some essays are most typical members and thus, characterise the genre the most, LITERATURE REVIEW 19 while other essays are least typical (peripheral) members. This is illustrated in Figure 2.2, below. less typical examples typical examples , argumentative essay genre Figure 2.2: Framing the genre classification task in terms of prototype After establishing how genre type membership is determined, Swales turns to a short, but considered definition of genre, which is adopted in this research project: genre refers to a class of communicative events, in which the participants share some communicative purpose(s). This amounts to the rationale of the genre, which determines the discursive structure, style and content. Exemplar members of a genre demonstrate patterns of similarity, with regard to structure, style, content and intended audience (Swales, 1990:58). Such exemplars are generally considered prototypical by members of the discourse community. As mentioned earlier, the task of this project is to label texts as prototypical and non- prototypical instances of the genre of argumentative essays. Determining which essays are prototypical and which are not and in addition, which features make prototypical essays prototypical, is not as unbiased as it may seem. This is because prototypicality is determined by the discourse community, in this case the markers of the essays. The writers of the essays are still in training to become members of the discourse community — Swales (1990:53) suggests the term "apprentice members" — and in this discourse community, similar to many others, there are gatekeepers. The educators and markers (often the same people) seek to teach the students the norms of the discourse community, and therefore, essentially help preserve these norms and keep the non- compliant out (by giving their essays 'poor' marks). It therefore seems reasonable to assume that 'good' marks are indicative of prototypical essays and vice versa. CHAPTER 2 20 According to Swales (1990:52), communicative purpose, form, structure and audience are properties that determine the prototypicality of a member of a particular genre. It can therefore be argued that argumentative essays written within the sphere of academic discourse can be viewed as a genre type based on similarity of communicative purpose and audience. It is then assumed that this similarity must hold for form and structure too. The features relating to the form and structure of the students' essays used in this project are thus very important in classifying the texts. The features of such prototypical essays can, undoubtedly, be determined through detailed micro-linguistic analysis. Such detailed analysis, however, would not suit the purposes of this project, which seeks to make fairly quick classifications (even if some accuracies must be lost). Yet, determining (prescribing) the features of the genre a priori would potentially limit the accuracy of such classification. Therefore, a large selection of linguistic features is used to classify texts as prototypical or non- prototypical. The background to these features is discussed in Section 2.3 below, in a comprehensive review of the features and methods used in automatic genre classification, and the features used to classify the texts in the corpus will be discussed in detail in Section 3.3. 2.3 Overview of previous automatic genre classification studies This section reviews previous relevant work in the field of automatic genre classification, and aims to provide a review of the state-of-the-art in automatic genre classification studies. First, background work relating to genre classification and seminal works in the field are discussed; thereafter research projects that are relevant and have a similar purpose to this research project are reviewed in more detail. Finally, those studies that use SVMs for machine learning, for the purposes of genre classification, are discussed in detail. In the main, each study is discussed separately, because there is much variation between data, features, application and rationale of each project. It should be noted that as this 7 The most recent review of the state-of-the-art in this field was conducted by Santini (2004b). It is a very complete review but has a strong interest in web-specific classification, thus it tends to refer only briefly to studies relating to more traditional genres. LITERATURE REVIEW 21 project is only concerned with linguistic features of genres, genre classification of web- specific genres is for the most part not discussed as many of the features used, relate to layout and HTML encoding. Moreover, such genre classification studies are not directly relevant to this project as they have an entirely different purpose; they mainly classify electronic genres that are very different from traditional genres for purposes of information retrieval in web searches. They also often address matters particularly Q relevant to electronic genres such as genre evolution. 2.3.1 Overview of seminal works in genre classification studies Biber's (1988) seminal work provides the background to genre classification studies and has become a classic in this field (see Johannesson & Wallstrom, 1999, for a study that makes direct use of Biber's features). Moreover, it has influenced the Expert Advisory Group on Language Engineering Standards' guidelines on text typology (EAGLES, 1996:23-25). His work (Biber, 1988) in language variation sought to determine the dimensions upon which spoken and written varieties differ linguistically. The data he used were drawn from two corpora: the LOB Corpus of British English and the London-Lund Corpus of Spoken English (Svartvik & Quirk, 1980). In order to compensate for the lack of non- published written texts in the corpora, an additional collection of personal and professional letters were added to the two corpora. In order to establish the underlying dimensions upon which these varieties differ, he analysed twenty-three spoken and written 'genres' (Biber, 1988). Such a wide variety of 'genres' was covered in an attempt to make use of data that cover the complete range of situational variation. It should be noted that he makes use of the groupings already used in the corpora, and does not create his own additional groupings, this would seem to imply that he agrees with the labelling of such groupings as 'genre'. Indeed, he goes on to define his use of genre, using it to refer to "text categorisations made on the basis of external criteria relating to author/speaker purpose" (Biber, 1988:68). Furthermore, he See for example, Santini (2005a); Crowston and Kwasnik (2004); Shepard, Waters and Kennedy (2004); Rehm (2002); and Roussinov, Crowston, Nilan, Kwasnik, Cai, and Liu (2001). CHAPTER 2 22 considers text-type to refer to texts grouped according to similarity in linguistic form (Biber, 1998:70). It has already been established in Section 2.2 that form and communicative purpose are both considered genre-defining in this project. Thus Biber's (1988) referring to these groupings of texts, as 'genres' is not quite what is meant here. For example, the grouping biographies would be considered a genre type according to the meaning this research project assumes, but many of the other groupings: religion, academic prose, and humour, would not. Biber was, however, not actually seeking to classify genre types in his study, but rather, as previously mentioned, to determine the linguistic variations between spoken and written varieties. Therefore, the relevance of his use of genre is less important. Rather, it is his methodology and linguistic features that were deemed essential to informing this project. Biber (1988:71-72) made use of sixty-seven linguistic features in his study, which were identified from a survey on previous studies of spoken and written variation. Similar to the present study, he selected the largest possible range of potentially salient features and made no a priori decisions regarding their importance. He grouped these features into sixteen grammatical categories (Biber, 1988:72): (A) tense and aspect markers; (B) place and time adverbials; (C) pronouns and pro-verbs; (D) questions; (E) nominal forms; (F) passives; (G) stative forms; (H) subordination features; (I) prepositional phrases, adjectives, and adverbs; (J) lexical specificity; (K) lexical classes; (L) modals; (M) specialised verb classes; (N) reduced forms and dispreferred structures; (O) coordination; and (P) negation. He then determined the frequencies of each of these linguistic features in all the genres, in order to study co-occurrence patterns among the features. These co-occurrence patterns indicate functions or dimensions underlying the variation between varieties. In LITERATURE REVIEW 23 order to establish the underlying dimensions, Biber (1988) made use of factor analysis. This type of multivariate statistical analysis derives a reduced set of variables from a large set of original variables. In this case, the original variables were the frequencies of the linguistic features, which were reduced to a set of factors. Thus, each factor represents a group of linguistic features that had a high frequency of co-occurrence. In factor analysis, first the correlations between all features are established and displayed in a matrix (correlation matrix). Second, the size of the correlations are compared, for example, a large negative correlation indicates that the presence of the first variable correlates with the absence of the second variable. Similarly, for a large positive correlation, the presence of the first variable correlates with the presence of the second variable. The correlation coefficient, if squared, indicates the statistical significance of the relationship between variables by measuring the percentage of variance between them. This procedure is described in more detail by Biber (1988:80- 97). Using this technique, he (1988:115) determined seven factors: 1. informational versus involved production; 2. narrative versus non-narrative concerns; 3. explicit versus situation-dependent reference; 4. overt expression of persuasion; 5. abstract versus non-abstract information; 6. on-line informational elaboration; and 7. factor 7, indicating academic hedging, but unlabelled due to under-representation. The factors established by Biber (1988) are not used in this research project. Their mention is relevant, however, because they represent much of the variation between genres, albeit with the focus on written and spoken varieties. Moreover, because these factors were determined using the sixty-seven features, mentioned above, it is plausible to assume that the features can be potentially useful discriminators for the current project. Indeed, many of the features that will be discussed in Section 3.3 are derived from Biber (1988). This feature set used by Biber (1988) has been substantially enlarged in recent work; for example, Biber, Conrad, Reppen, Byrd, Helt, Clark, Cortes, Csomay and Urzua (2004). Several of the features used in this later work are also used in this project, as will be discussed in Chapter 3. CHAPTER 2 24 Karlgren and Cutting (1994) took Biber's work (1988; 1989) as a starting point for their research. They make use of similar features to those of Biber's study (1988; 1989), paying more attention to those that can be (readily) automatically computed. They made use of frequency counts of the following features (Karlgren & Cutting, 1994:1072): A) Parts-of- speech 1. nouns; 2. present participles; 3. present tense verbs; 4. prepositions; 5. adverbs; 6. first person pronouns; and 7. second person pronouns. B) Lexical words 1. it; 2. me; 3. that; 4. therefore; and 5. which. C) Ratios and lexical information 1. average number of words per sentence; 2. average number of characters per sentence; 3. type/token ratio; 4. average number of characters per word; 5. total sentence count; 6. total character count; and 7. long words (longer than six characters). The frequencies were computed for each of the texts, which were taken from the Brown University Corpus of Written American English (Francis & Kucera, 1982). Karlgren and Cutting (1994) then made use of discriminant analysis on these texts using the computed frequencies. This type of statistical analysis determines a set of discriminating functions, which can discriminate (to varying degrees of accuracy) between the classes LITERATURE REVIEW 25 under examination. Once these functions have been established, new and unseen texts can be categorised according to their feature counts. Karlgren and Cutting (1994) conducted three classification experiments, and similar to Biber (1988), they did so using the categories provided by the corpus. The categories have three levels: the first level is comprised of the classes informative and imaginative, and the second level is made up of the classes press, fiction, non-fiction and miscellaneous. The third level has fifteen groups, for example, reportage, popular lore, and skills and hobbies (Karlgren & Cutting, 1994:1071). As LOB is modelled on the Brown Corpus, the groupings at level three are the same. The three experiments were conducted using the different groupings of each level. The experiments entailed training (the reason for this is explained in Karlgren & Cutting, 1994:1073). Categorisation at the first level resulted in correct classification of 478 cases out of 500 (approximately 96 percent). At the second level grouping, 366 out of 500 cases were correctly classified (approximately 73 percent). The texts that proved the most difficult to classify, were from the subset labelled learned/humanities, which were largely misclassified as miscellaneous instead of non-fiction. These texts were instances of academic prose written within disciplines in the humanities faculty (Karlgren & Cutting, 1994:1072). This finding is relevant to this project, which seeks to classify texts drawn from various disciplines across most of the faculties (see Section 3.2 for actual faculties). Thus, this project is set quite a challenge. As the subset learned/humanities is relevant to this project some reasons are posited for its misclassification. The reasons are difficult to determine. Possibly, the four categories in level two are somewhat meaningless, and should therefore not be used for groupings; or else, the four level-three categories under miscellaneous {religion, skills and hobbies, popular law, belles lettres) are similar in feature frequency counts to those in learned/humanities. It seems unlikely that the texts in these five groups are similar in structure, perhaps then, more (either in number or in ability) discriminating features need to be used.9 9 See Sigley (1997) for more on corpora categories and their validity. CHAPTER 2 26 That this is a likely explanation is given further credence when regarding the classification results at the third level: 258 texts out of 500 were correctly classified (52 percent). And again, the group with the lowest accuracy level was that of learned/humanities. The texts in this group were misclassified chiefly as religion and belles lettres, which are grouped under miscellaneous (Karlgren & Cutting, 1994:1073). The 'genre' classes used in Karlgren and Cutting's (1994) study, similar to that of Biber's (1988), indicate a dissimilar definition of genre to what is intended by the term in this research project. Firstly, the labels given for level one are indicative of rhetorical classes rather than genre types. Secondly, of the categories at level two, only press, perhaps, may be considered a discourse type, while fiction and non-fiction are categories based on content rather than genre, and miscellaneous is void of any meaning as a category at any level. This somewhat contrived classification of texts may be an explanation for the poor classification results for the fifteen categories in experiment three (see also Kessler, Nunberg & Schiitze, 1997:33). In contrast with Biber (1988) and Karlgren and Cutting (1994), Kessler et al. (1997) use genre to refer to texts that are grouped according to similar communication purpose, which in turn is connected to the formal properties of the texts. Accordingly, they term attributes indicative of genre type generic facets (Kessler et ah, 1997:33). These facets, they explain, refer to the practical function and communication (often indicative of rhetorical strategies) of texts in a class, and are associated with a set of linguistic properties, which they term generic cues (Kessler et al., 1997:33). They claim three advantages for generic facets. Firstly, these facets provide a framework for understanding genres. Secondly, some applications requiring text classification, especially those in an information retrieval context, may find categorisation according to facet rather than genre advantageous. Thirdly, using facets as part of the genre classification solution rather than genre type classes, will allow for labelling an unknown genre category in terms of its facets. Classifiers that are trained to recognise genre classes only will misclassify any unknown genre type. LITERATURE REVIEW 27 As previously mentioned, Kessler et al., (1997:34) assert that these facets can be identified by generic cues, which are observable surface cues. In their study, they make use of fifty-five generic cues, represented by four groups: structural, lexical, character- level, and derivative cues. Structural cues are passives, nominalisations, syntactic categories of words and topicalised sentences. Examples of lexical cues are terms of address and words used to express dates. Character-level cues are punctuation marks, such as question marks and exclamation marks, capitalised and hyphenated words, and acronyms. Derivative cues are variation measures (such as standard deviation) and ratios derived from counts of lexical and character-level cues, such as characters per word. The ratios were combined (to form nearly three-thousand different ratios) and represented as natural logarithms. For example (Kessler etal., 1997:34): ^og-^ + /3log^~+rlog^ = {a-fi+r)\og{W + l)-alog(S + l) + 01og(C + l)-ylog(T + l), where: W = tokens, S = sentences, C = characters, and T = tokens. Kessler et al. (1997), similar to Karlgren and Cutting (1994), make use of the Brown Corpus. In contrast to Biber (1988) and Karlgren and Cutting (1994), however, they make their own class distinctions, using three generic facets, which they label brow, narrative and genre. Brow relates to the intellectual background of the intended reader, and has the levels of popular, middle, upper-middle and high. Narrative is a binary facet indicating whether the text is written in a narrative style or not. The genre facet has the values of reportage, editorial, scitech, legal, non-fiction and fiction (Kessler et al., 1997:34). For their study, Kessler et al. (1997) made use of logistic regression and neural networks. The first classification method they used, logistic regression, is a statistical technique in which the transformed values of the dependent variable (in this case, the facet) are predicted by a linear combination of the predictor variables (in this case, the CHAPTER 2 28 cues). The transformation of the dependent variable is called the link function; the logit transformation is used for logistic regression (for more on this see Statsoft, 2004). The second classification method that Kessler et al. (1997) used, neural networks, is a type of machine learning technique that was originally motivated by biological learning systems (for computational details, see Mitchell, 1997:81-127). For the logistic regression method, accuracies of 78 percent for the narrative facet, 61 percent for the genre facet and 44 percent for the brow facet were obtained. The neural networks method performed better overall, with accuracies of 82 percent for the narrative facet, 75 percent for the genre facet and 47 percent for the brow facet. The detailed results for each subcategory, for both logistic regression and neural networks can be viewed in Kessler et al. (1997:37). It is relevant to note that lower accuracies were obtained for scitech, non-fiction, editorial and legal than for reportage and fiction. Kessler et al. (1997:36) suggest several possible reasons for lower accuracies. Sparse training examples are put forth as an explanation for poorer performance on the legal and scitech classes. For the non-fiction and editorial classes, lower accuracy was, in the main, a result of non-fiction being misclassified as editorial. As was remarked earlier, non-fiction is a category based on content, not genre. Moreover, non-fiction is at a higher level of analysis than editorial, which is a genre type. It seems then, that non- fiction is an unwise choice as a genre category. Indeed, Kessler et al. (1997:36) propose making editorial a subcategory of non-fiction in future studies. They also suggest further decomposition into facets labelled opinion and institutional author. The above three studies (Biber, 1988; Karlgren & Cutting, 1994 and Kessler et al., 1997) form the background to current studies in the field of automatic genre classification. The next section discusses contemporary genre classification studies that are relevant to this research project. 2.3.2 Overview of contemporary genre classification studies This section reviews major studies in automatic genre classification that are relevant to this research project, with regard to application, corpus, features, and method. The studies reviewed here are presented in chronological order, in order to trace the LITERATURE REVIEW 29 improvements on features and findings by different studies at different times on the same features. It should be noted, however, that equal comparisons between studies can often not be made due to a lack of standards in automatic genre classification, regarding a benchmark corpus for training and testing, evaluation metrics, baseline performance or indeed, training methodology. Therefore, each study is discussed separately, and similarities between studies at any level are highlighted. In their study, Wolters and Kirsten (1999) make use of content words, function words, POS frequencies10 tagged with the German Stuttgart-Tubingen Tagset (STTS) of 54 tags and punctuation. They used a German corpus, LIMAS, which was modelled on the Brown Corpus and therefore has similar 'genre' classes (Wolters & Kirsten, 1999:143). They provide in-depth details on feature sets and feature distributions, which they analysed before undertaking any classification tests, in order to determine if indeed the documents in their corpus differed according to their groupings (Wolters & Kirsten, 1999:144-145). What is most noteworthy is that they used decision trees as an exploratory device (and not for classification purposes). In Chapter 3, it will be shown that pre-classification exploration of features can help remove irrelevant features before training. In genre classification, decision trees are one of the most used classification techniques (see for example Finn, 2002; and Dewdney, VanEss-Dykema & MacMillan, 2001). Wolters and Kirsten (1999) use three fc-nearest neighbour algorithms (KNN): RIBL,11 learning vector quantisation (LVQ)12and IBLl(-IG).13 The KNN algorithm is a type of instance-based learning in which training examples are stored and used to classify new instances. This technique, unlike many machine learning techniques, can construct a different approximation to the target function for each new query, rather than attempting to approximate the target function for the hypothesis space only once (Mitchell, 1997:230-231). In order to classify new instances, the KNN algorithm assigns a particular classification to a new instance, based on the classification of its nearest neighbours, defined by standard Euclidean distance (Mitchell, 1997:231-232). 10 The corpus was tagged with the MALAGA system. See Beutel (1998) for more on the system. 11 See Emde and Wettschereek (1996) for more on RIBL. 12 For more on learning vector quantisation see Kohonen, Kangas, Laaksonen, and Torkkola (1996). 13 See Daelemans, Van den Bosch and Weijters (1997) for more on IBL1. CHAPTER 2 30 Wolters and Kirsten (1999) test their algorithms using different feature sets, composed of various combinations of content words, function words, POS and punctuation. Their combining of the various features, in order to determine each feature set's usefulness for genre classification, encouraged the approach to feature combination used in this research project. Their results (Wolters & Kirsten, 1999) are somewhat difficult to interpret for two reasons. Firstly, they conduct tests on classes that appear to be more subject than genre- related. They claim that they realise this distinction and are examining both genres (press texts and fiction) and domains; however, from the two 'genre'-only categories they provide this is not clear. Secondly, they report on precision and recall only, on particular tasks, which they do not specify in sufficient detail. Nevertheless, they appear to have excellent results (using 10-fold cross-validation).14 The most relevant results for this research project are on their task A, which contains forty-five academic texts drawn from the fields of humanities (H), and science and technology (S). For these academic texts, they achieve an average recall of 99.67 percent and precision of 100 percent on their test set, using content words (500 lemmas) and POS tags, training with the LVQ algorithm (Wolters & Kirsten, 1999:147). These results must be interpreted with caution because it appears that genre and topic classification are thrown together here. Genre and topic are not necessarily orthogonal to one another. In terms of classification, some overlap is therefore to be expected. It is not the overlap that is problematic here, but rather the lack of clear definition and identification of the target of classification. A clear definition of genre and the genre classes to be classified is required for results to be interpreted clearly and the findings of the study to be extended to other studies. The poor definition of genre and genre classes is a critique emphasised throughout this project. In addition to Wolters and Kirsten's (1999) results, they report on classification on a larger data set, with differing results (they report only on precision in this experiment, using LVQ). Task H (109 documents) achieves its lowest precision at 19.6 percent and its highest precision at 100 percent; task S (72 documents) achieves its lowest precision For an explanation of cross-validation see Chapter 4. LITERATURE REVIEW 31 at 17.3 percent and its highest precision at 74.4 percent. This experiment thus confirms that POS tags are good discriminators for academic texts. In view of tasks H and S, Wolters and Kirsten (1999) provide strong evidence that different word types (function and content words, versus function words only) are not equally discriminatory for texts from the humanities and those representing the field of science and technology. This implies that cross-domain classification may not be as successful as task A may indicate. As they do not expound on the genre categories in task A, it is rather difficult to judge whether their classifier performed good genre classification across domains. But certainly, if they classify across domains, this can be favourably compared to Finn's (2002, see below) cross-domain classification accuracies, which are lower. Thus, it would appear that cross-domain genre classification has been successful to varying degrees. This is a factor that this research project considers and will report on in Chapter 4. In contrast to the traditional BOW approach to text classification tasks, Stamatatos et al. (2000a), make use of the frequencies of the most used words not in the training corpus but rather in a corpus representative of the written language of English: the British National Corpus (BNC; Aston & Burnard, 1998). In addition to these frequencies, they make use of the most frequent punctuation marks: full stop, comma, colon, semicolon, quotation marks, round brackets, question mark, and hyphen (Stamatatos et al, 2000a:812). They drew their data from a section of Wall Street Journal (WSJ): editorials (40 documents), letters to the editor (40 documents), reportage (40 documents), and spot news (40 documents).15 Following Karlgren and Cutting (1994), Stamatatos et al (2000a) make use of discriminant analysis for their genre classifier. They report on accuracies based on a 50 percent training set and 50 percent testing set, achieving 97.5 percent, using the top thirty words from their list derived from the BNC (Stamatatos et al, 2000a:811). Their findings indicate that using the top fifty words leads to overfitting. Their results inform the features and results of this project, which are reported in Chapters 3 and 4 respectively. Stamatatos et al. (2000a:811) provide further evidence that the traditional Documents containing What's News or Who's News (Stamatatos et al., 2000a:810). CHAPTER 2 32 BOW approach does not discriminate as successfully in genre classification as in topic text classification. Combining words from their BNC list and the top eight punctuation marks results in even better accuracy; depending on the number of words used, they can achieve 100 percent accuracy (Stamatatos et ah, 2000a:812). Moreover, this result stays far more stable when training with much smaller sets of data. This would seem to imply that punctuation is a good discriminator of genre and can thus be reliably used with corpora of limited size. Stamatatos, Fakotakis, and Kokkinakis (2000b) make use of twenty-two features derived from a text-processing tool for Modern Greek, called the sentence and chunk boundaries detector (SCBD). The features are grouped according to token level from the output of the sentence boundary detector, phrase level based on the output of the chunk boundary detector and analysis level from the text analysis of SCBD. These features are given below (Stamatatos et ah, 2000b:477^78): (A) Token level 1. detected sentences/words; 2. punctuation marks/words; and 3. detected sentences/potential sentence boundaries. (B) Phrase level 1. detected NPs (noun phrases)/total detected chunks; 2. detected VPs (verb phrases)/total detected chunks; 3. detected APs (adverbial phrases)/total detected chunks; 4. detected PPs (prepositional phrases)/total detected chunks; 5. detected CONs (conjunctions)/total detected chunks; 6. words included in NPs/detected NPs; 7. words included in VPs/detected VPs; 8. words included in APs/detected APs; 9. words included in PPs/detected PPs; and 10. words included in CONs/detected CONs. (C) Analysis level 1. detected keywords/words; LITERATURE REVIEW 33 2. special words/words; 3. assigned morphological descriptions/words; 4. chunks' morphological descriptions/total detected chunks; 5. words remaining unanalysed after pass 1/words; 6. words remaining unanalysed after pass 2/words; 7. words remaining unanalysed after pass 3/words; 8. words remaining unanalysed after pass 4/words; and 9. words remaining unanalysed after pass 5/words. Using the twenty-two features listed above, Stamatatos et al. (2000b) trained their classifier on a corpus of Modern Greek (250 documents, equally divided into ten genres), which they compiled themselves from Internet sources. The corpus is comprised of the genres of press editorial, press reportage, academic prose, official documents, literature, recipes, curricula vitae, interviews, planned speeches, and scripted broadcast news (Stamatatos et al., 2000b:481). It is clear from this division that although they provide some criticism of the corpora typically used in genre classification experiments, such as the Brown Corpus16, for its poorly defined genres, some of their genre groupings are also fuzzy. They are blurred both in terms of what is meant by genre in this project and in terms of their definition of genre (Stamatatos et ah, 2000b:480). The most relevant example of this is academic prose. This project considers academic prose a style of writing associated with different genres written in an academic context; recipes, in comparison, is what is meant by genre here. Making use of the twenty-two-variable feature vectors for each document Stamatatos et al. (2000b:479^180) train their classifier, using the statistical techniques of discriminant analysis and multiple regression. The latter technique is used to predict the value of a dependent variable from two or more independent variables, with each independent variable being assigned relative weight. This weight is determined by the relative contribution of each independent variable in determining the value of the dependent variable. Using the mean values of the variables, standard deviation and correlation coefficients of the variables can be derived, in order to determine the slope of a straight line that will pass through the majority of the data points used. This line is known as the See Section 2.3.1 for a discussion of these genres, in the review of Karlgren and Cutting (1994). CHAPTER 2 34 regression line and it approximates to the data, allowing new data points to be classified (Oakes, 1998:33-36). Stamatatos et al. (2000b) report accuracies based on a 50 percent training set and a 50 percent testing set, which results in ten documents per genre for training and ten documents per genre for testing. Overall accuracy on both discriminant analysis and multiple regression is reported as an identification error rate of 0.18, which is the number of incorrectly classified texts divided by the total number of texts (Stamatatos et al., 2000b:482-^-83). The most significant result for this research project is presented in a confusion matrix of genre classification over all ten genres, showing that the error rate for academic prose is zero. This success, however, must be interpreted cautiously as the classifier is tested on only ten documents, representing a rather small sample, and therefore it is likely that the sample may not be representative of the population. This may result in random error when attempting to generalise this model to other data (Biber, 1993a:219-220). Indeed, the learning curve Stamatatos et al. (2000b:491) present of seven to fifteen documents indicates that the classification accuracy does indeed improve marginally (from 0.18 to 0.15 identification error rate) with more training data. Biber (1993b:243) states that linguistic representativeness depends on sample size among other factors, such as a thorough definition of what the target is. He presents an equation that can be used to determine the required sample size for single features (Biber, 1993b:253). This equation shows that the required sample size is a function of the standard deviation relative to the mean of a feature (Biber, 1993b: 254). In his example, the feature with the highest mean score is nouns, which is the most frequent word class (Biber, Johansson, Leech, Conrad & Finegan, 1999:65). The sample size required for this class is 59.8 texts. For any features that are less frequent than this a larger sample size will be required. This would seem to indicate that sample size cannot be determined but that ten texts are likely to be too few to be linguistically representative. The problem of training sample size is discussed in more detail in Chapter 4. LITERATURE REVIEW 35 In addition to this successful classification of academic prose, the discriminatory ability of individual features of Stamatatos et a/.'s (2000b) overall genre classifier is significant. They determine the overall contribution of each feature to the classification of their genres using Mests, and find that the most significant indicators of genre are punctuation marks/words, words included in PPs/detected PPs and special words/words (Stamatatos et al., 2000b:491). Again, this must be interpreted with caution as Mests assume normal distribution, which is often not the case with linguistic data (Karlgren, 1999:153; also Santini, 2004a:3). Rather non-parametric tests should be used, such as the Mann-Whitney U rank sum test (Oakes, 1998:11). Normal distribution, parametric and non-parametric tests will be discussed further in Chapter 3. The former of the two facets suggested by Kessler et al. (1997:36), in Section 2.3.1, opinion, is one that Finn (2002) applied to his own study.17 He does not explicitly use the term generic facet, but rather genre class. Such a term can be misleading, as it appears to refer to genre types. It is not immediately apparent but it seems that he uses the term to refer to aspects of a document by which it can be classified in terms of genre, rather than referring to genre types (see also Santini, 2006a for commentary on this). Finn (2002:3) cites the readability and level of technical detail of a document as examples. Finn's study (2002:11-15) draws texts from the web in the form of news articles (796 documents from three domains), and movie and restaurant reviews (1354 documents I Q from two domains). Details of the websites from which these documents originated, and the pre-classification of such documents can be found in Finn (2002:69-71). His study focuses on the genre classes of opinion/fact and positive/negative. The first genre class, opinion/fact, is a subjectivity classification and is studied by investigating whether news articles report facts or the author's opinion. Finn (2002) mentions that this is a distinction often made within the discourse of press reportage, and furthermore that documents offering the author's opinion are often editorials. The latter remark is relevant to the suggestion provided by Kessler et al. (1997:36), as discussed earlier, with regard to their classifier's poor performance on editorials. The second genre class, See also Finn, Kushmerick and Smith (2002). The corpus used by Finn (2002) is available for free download at http://www.smi.ucd.ie/hyppia.html. CHAPTER 2 36 positive/negative, seeks to establish whether movie and restaurant reviews are positive or negative. In addition to classifying texts according to these genre classes, Finn (2002) examines the accuracy of such classification across subject domains. All the studies discussed above make use of corpora drawn from a variety of subject domains, but none test or at least report any statistics on the classification accuracy across domains. Finn (2002) explores classification in the domains of football, politics, finance, movies, and restaurants by training the classifier on one domain and testing in another, thereby determining the cross-domain accuracy. This focus on cross-domain accuracies is relevant to this research project where all the data are drawn from and classified across domains. Finn's (2002) classifier was trained using three feature sets: BOW, POS and a set of features relating to lexical information, which Finn (2002:15) terms "hand-crafted text statistics". The hand-crafted (HC) set is composed of the following (Finn, 2002:107- 108): A) Lexical information 1. average number of sentences beginning with /; 2. average sentence length; 3. average word length; 4. long words (longer than five characters); and 5. number of words in document. B) Lexical words 1. stopwords:1 certainly, her, highly, him, his, it, large, little, me, mine, mostly, much, my, our, probably, that, they, us, very, we, where, which, you, yours; 2. keywords: absolutely, altogether, archive, article, column, completely, doesn't, editorial, enormously, entirely, extremely, fact, feature, fully, greatly, highly, intensely, isn't, news, opinion, perfectly, report, strongly, thoroughly, totally, utterly, and very. C) Punctuation symbols: / " $ % & '() * + , - . . • ; = . and ?. Stopwords are usually excluded in text classification that takes a BOW approach because of their high frequency. LITERATURE REVIEW 37 The above set of features was constructed specifically for the opinion/fact genre class and was gathered by asking people for their intuitions on good indicators of this class. Such an approach seems rather 'unlinguistic', which indeed it is, but it must be considered in light of the purpose of the classifier. This kind of 'genre' classifier is intended to be helpful to users of document retrieval systems. Therefore, it seems sensible to collect data from such users regarding their needs, judgements of relevance of documents and the factors upon which the judgements are based. Many studies in the field of genre classification of a similar purpose to Finn's (2002) project also acquire information for their task from users.20 Moreover, these studies conduct user surveys, in order to generate genre taxonomies. Finn (2002) makes use of each feature set separately, in order to compare their genre prediction accuracies across and within domains. He trains the classifier on these features using decision-tree learning. This technique is a type of inductive machine learning in which the learned function is represented by a decision tree. Decision trees classify instances in a tree structure, sorting instances down the tree from root to leaf nodes, with leaf nodes providing final classifications. Each node tests an attribute of an instance under classification, with each branch corresponding to a value of the possible values for a particular attribute. Such attribute testing iterates, so that classification continues down the tree (Mitchell, 1997: 52-53). For the first genre class, opinion/fact, the following accuracies (defined as the average percentage of correct class predictions on 10-fold cross-validation) in single domain experiments are reported: the POS feature set results in an average accuracy of 84.7 percent, the BOW feature set achieves an average accuracy of 87.2 percent, and the HC set results in an average accuracy of 88.3 percent (Finn, 2002:74). The accuracies in the domain transfer experiments are reported as follows: the BOW feature set results in an average accuracy of 67 percent, the HC feature set achieves an average accuracy of 71.8 percent and the POS feature set results in an average accuracy of 81.5 percent (Finn, 2002:74). See for example, Santini (2006a), Meyer zu Eissen and Stein (2004); also Roussinov et al. (2000). See Rosso (2005) for more on the role of users in generating genre taxonomies. Using C4.5 (Quinlan, 1993), a learning algorithm, which implements the decision-tree learning technique. CHAPTER 2 38 Finn's (2002) examination of the accuracies and analysis of the performance of these particular features are of relevance to this research project. He suggests that the POS feature set is a more generalisable indicator of genre class then the BOW feature set, which is domain specific (Finn, 2002:75). Furthermore, he deduces that the POS feature set and the shallow text statistics of the HC feature set perform the best overall for genre classification of opinion/fact. For the second genre class, positive/negative, the following accuracies in single domain experiments are reported: the POS feature set results in an average accuracy of 61.3 percent, and the BOW feature set achieves an average accuracy of 82.7 percent (Finn, 2002:78). This result is not surprising, as positive/negative appears to be a content-based rather than genre-based class. The accuracies in the domain transfer experiments were reported as follows: the POS feature set achieved an average accuracy of 47.1 percent and the BOW feature set resulted in an average accuracy of 47.8 percent (Finn, 2002:79). Intuitively, it is therefore to be expected that content words will have better discriminatory strengths than POS features. In addition, this provides further evidence against using BOW only as a feature for building classifiers that generalise well; it is domain-based, thus it cannot be expected to generalise well across domains. Finn (2002:76) concludes that the choice of discriminatory feature sets depends on the genre classification task at hand and that it is likely that a combination of feature sets will produce the most favourable results. This conclusion is significant for this research project and provides the motivation for the mixed feature set that was used (see Chapter 3 for the features used). In their study, Lee and Myaeng (2002) have an unusual approach regarding features they consider to be genre-revealing. Instead of linguistic measures, they use term frequency (tf) and inverse document frequency (idf).23 Tf was first suggested by Luhn (1957), in order to establish relative term importance in a document. The general premise of tfis that terms that occur frequently in a document are more representative of 23 In their follow-up study, Lee and Myaeng (2004) make use of some linguistic features such as nouns, pronouns, exclamations, verb endings, person names and special symbols (punctuation marks, currency and mathematic symbols). However, the focus of their study is using genre classification to assist in topic classification; as a result, their reported results and methodology are not sufficiently detailed to review meaningfully here. LITERATURE REVIEW 39 that document in terms of its meaning than terms that occur infrequently in that document (Jurafsky & Martin, 2000:651). Similarly, idf (first defined by Sparck Jones, 1972) is based on the hypothesis that terms that occur in specific documents rather than having equal distribution over all the documents in a collection are more likely to be discriminatory (Jurafsky & Martin, 2000:651 & 653). They determine the tfs and the idfa for each document and their terms in their corpus, which is comprised of English and Korean documents (7828 and 7615 documents respectively) collected from the web. These documents represent the following seven genres: reportage, editorial, research articles, critical reviews, personal homepage, Q&A (questions and answers), and product specification. Using the tfi, and the idfs in various ratios they deduce an equation they term the "similarity-based approach" (Lee & Myaeng, 2002:146-148), which calculates the genre class of a new document. The results of training (for the English document set only), using the similarity-based approach on a training set of 50 percent and a testing set of 50 percent, is reported, using micro-average recall/precision. The highest result of their approach is 0.87 (Lee & Myaeng, 2002:148). The result most relevant to this project is the accuracy obtained on the research articles genre using 130 terms: micro-average recall/precision is reported as 0.99, and accuracy (deduced from the confusion matrix) is 97.7 percent (Lee & Myaeng, 2002:148). This accuracy is reported on a test set of 600 documents, which removes any concern regarding sample size. Certainly, it would also be necessary to gain more knowledge about the homogeneity in terms of topics, faculties in which the papers were written, and structure, in order to further interpret this high accuracy. In their study, Kelih, Antic, Grzybek, and Stadlober (2005) analyse the word length of 190 Russian texts, composed of 95 letters and 95 poems. They seek to determine to what extent word length contributes towards distinguishing texts according to author and genre (Kelih et al, 2005:498). They claim that to measure word length in characters, as genre classification studies have tended to do, is not a suitable unit of measurement for many languages (Kelih et al, 2005:499). Rather, they suggest measuring word length in the number of syllables per word; thus redefining a word as an orthographical-phonological unit. CHAPTER 2 40 They obtain six variables, all representing word length frequency distributions (Kelih et al, 2005:501). Using discriminant analysis and experimenting on various combinations of these six variables they can achieve up to 89.5 percent accuracy (Kelih et al, 2005:504. As with Karlgren & Cutting (1994), this is on a training set only). The accuracy is obtained, using only two variables: the relative proportion of four-syllable words and the quotient of dispersion. The former of the two variables is strongly discriminatory, and used on its own, can achieve 76.3 percent accuracy (Kelih et al, 2005:504). This compelling evidence for word length being a powerful discriminator of genres, if defined as syllables per word, provides motivation for inclusion as a feature in future projects (this will be discussed further in Chapter 5). The evidence for word length as a good discriminator also provided some motivation for the inclusion of readability scores as features, as word length (in characters and syllables) forms part of the readability scores detailed in Chapter 3. Santini (2004a) examines ten genres of fifteen documents each from the BNC: conversation, interview, public debate, planned speech, academic prose, advert, biography, instructional, popular lore, and reportage. Her approach to genre classification is to investigate the discriminatory ability of syntactic analysis. In genre classification, syntax is currently only partially employed in POS. This approach, however, is a rather shallow one. In following Argamon et al. (1998), she suggests using POS trigrams as indicators of deeper syntactic features. Santini (2004a) explores this in her study, using four sets of features; 835 POS24 trigrams excluding punctuation, 1033 POS trigrams including punctuation, 65 POS trigrams derived from the first set and 74 POS trigrams derived from the second set. Using these four feature sets Santini (2004a) trains her genre classifier using a Naive Bayes classifier. Naive Bayes is a type of Bayesian learning based on Bayes' theorem (Jurafsky & Martin, 2000:148-149). It provides a way to calculate the probability of a hypothesis (in this case of a document belonging to a particular genre), based on its prior probability and the actual probabilities observed in data (Mitchell, 1997:156). The corpus was tagged using the Constituent Likelihood Automatic Word-tagging System (CLAWS5 tagger; Garside & Smith, 1997). 25 The classifier used is from Weka open source machine learning software, available at http://www.cs.waikato.ac.nz/ml/weka/index.html. LITERATURE REVIEW 41 On stratified random sampling 10-fold cross-validation she reports the following average accuracies: 78.9 percent on feature set one, 78.6 percent on feature set two, 88.9 percent on feature set three, and 84.1 percent on feature set four (Santini, 2004a:4). These accuracies are given for the six written genres only, as they are the most relevant for this research project. After testing accuracy on POS trigrams, she conducts the same experiment using POS bigrams and unigrams, in order to establish whether performance using POS trigrams does indeed result in significantly improved classification. Analogous to the POS trigram experiment, the test on bigrams uses 451 POS bigrams without punctuation, 568 POS bigrams with punctuation, 36 POS bigrams derived from set one and 41 POS bigrams derived from set two. The classification accuracy on the written genres using POS bigrams reports 76.1 percent accuracy on set one and 70.5 percent accuracy on set two. When some selection of bigrams is made, accuracy improves to 86.8 percent on both sets three and four (Santini, 2004a:5). In the POS unigram test, only two feature sets are used due to lower feature frequencies. The first set is comprised of twenty unigrams excluding punctuation, and the second set is composed of twenty-four unigrams including punctuation. Accuracy on the former set is reported as 72.2 percent and on the latter set as 74.4 percent of the six written genres (Santini, 2004a:5). This overall good accuracy of POS influenced the use of POS in this project, which will be discussed in Chapter 3. Wastholm et al. (2005) conduct a genre classification study using the Stockholm-Umea Corpus (SUC, 1997) of 500 Swedish texts, which, as with the German LEVIAS corpus used in Wolters and Kirsten (1999), follows a division akin to the Brown Corpus.27 This corpus is tagged with the Preparatory Action for Linguistic Resources Organization for Language Engineering (PAROLE) tagset. Similar to Santini (2004), they use a Naive Bayes classifier29 trained using lemmas, POS, POS plus subcategories (for example, ' See Chapter 4, for a brief explanation of stratified random sampling. 11 For the complete categories, see Forsbom (2005:17). Available at http://spraakbanken.gu.seAb/parole/. 19 The Perl script for this classifier can be downloaded at CPAN: http://search.cpan.org/_kwilliams/Algorithm-NaiveBayes-0.03/. CHAPTER 2 42 proper noun) and complete PAROLE word classifications (for example, gender) (Wastholm et al., 2005:2). They run their experiments on nine upper-level genres (press reportage, press editorials, press reviews, skills, trades and hobbies, popular lore, biographies and essays, miscellaneous, learned and scientific writing, imaginative prose), using unigrams, bigrams and trigrams of each feature. Sentence markers were included for the bigrams and trigrams. Ten-fold cross-validation is used to refine their classifier, whilst fifteen percent of the texts are kept aside as testing data. On this test set the best reported results are overall error rates of 40 percent for the POS trigram set and 38.7 percent for the POS plus subcategories bigram set. The best performance for the lemmas set is a 53.3 percent error rate, which provides even more evidence for using linguistic data, rather than a BOW approach, for genre classification (Wastholm et al., 2005:3). Similar to Finn (2002), they suggest combining all feature sets for better classification performance (Wastholm et al., 2005:3). In addition to such overall testing results, Wastholm et al. (2005) provide precision and recall measures for the nine genres. The most relevant of these for this project are 33.3 percent precision and 25 percent recall on biographies/essays, and 63.6 percent precision and 77.8 percent recall on learned and scientific writing (Wastholm et al., 2005:3). The results on learned and scientific writing are in keeping with other studies but the results on biographies/essays are markedly poor. There are some potential reasons for this. The majority of the studies in automatic genre classification seek to make classifications of English texts; it can thus be assumed that the types of features discussed in these studies may not be equally successful if applied to other languages. It can also, however, be seen from some of the studies reviewed in this section that these features have been equally successfully applied to Greek and German. Language can thus not be put forth as a likely explanation for poor performance. Other potential explanations for performance are genre groupings, the sample size, and training and testing methodology. The last two can almost immediately be dismissed as being the most likely explanations. Training and testing methodology appear unlikely to explain LITERATURE REVIEW 43 poor performance as training, testing and validation are conducted scientifically, using the Naive Bayes algorithm. This algorithm has been used with good results by Santini (2004a), and it is therefore not probable that the training algorithm used can account for the poor performance. Size, too, is unlikely to explain poor performance because the corpus used by Wastholm et al. (2005) is composed of 500 texts, which is an adequate sample size. The most likely explanation is genre groupings, because the corpus follows the divisions of the Brown corpus. That these divisions are not always indicative of 'genre' has been discussed previously. This is a concern that Wastholm et al. (2005:3) address, which indicates that they are at least aware of this problem. Karlgren and Cutting (1994) were also rather unsuccessful in their classification of biographies, with an error rate of 65 percent. This provides further evidence in favour of clear definition of genre classes for automatic genre classification, as without such clarity the target of genre classification and desired outcomes are blurred. Boese (2005)3 focuses on web-genre classification, which has previously been mentioned as not relevant to this research project. However, as she examines many electronic versions of traditional genres, and uses an extensive list of linguistic features, her study is considered appropriate for review here. She examines ten genres (343 documents), collected from the web, in terms of three sets of features, which represent style, form and content respectively. Similar to Meyer zu Eissen and Stein's (2004) presentation-related features, form represents document layout and web-specific features, such as HTML tags, and is therefore not discussed here (Boese, 2005:29). The style feature set is (Boese, 2005:26): (A) Readability statistics31 1. Kincaid; 2. ARI; 3. Coleman-Liau; 4. Flesch Index; 5. Fog Index; 6. Lix; and 7. SMOG-Grading. 30 See also Boese and Howe (2005). 31 ARI and the Flesch readability statistics will be discussed further in Chapter 3. CHAPTER 2 44 (B) Sentence information (counts) 1. characters; 2. words; 3. sentences; 4. paragraphs; 5. syllables; 6. questions (counts and percent); 7. passive sentences (counts and percent); 8. sentences with at most 13 words (counts and percent); 9. sentences with at least 28 words (counts and percent); 10. average length in words; 11. longest sentence length; 12. shortest sentence length; 13. average length of words (in characters); and 14. average length of paragraphs (in sentences). (C) Word usage (counts and percent) 1. conjunctions; 2. pronouns; 3. prepositions; 4. nominalisations; and 5. verb types (counts only). (D) Sentence beginnings (counts) 1. pronouns; 2. interrogative pronouns; 3. articles; 4. subordinating conjunctions; 5. conjunctions; and 6. prepositions. Many of the content features relate to web-specific genres and formats. Content features relevant to this study are (Boese, 2005:31): (A) the top fifty common words in the BNC (derived from Stamatatos et al., 2000a); (B) the most frequent words across corpus; LITERATURE REVIEW 45 (C) the most frequent words from each genre; (D) a stopword list of 430 words; and (E) twenty-six punctuation types. Using different combinations of these features, Boese (2005) trained a genre classifier using the LogitBoost algorithm, which makes use of additive logistic regression (see Kessler et al., 1997). On stratified 10-fold cross-validation, the best overall accuracy reported is 92.1 percent, making use of a set including all features except term frequency (Boese, 2005:50). On this same set, the classifier performs very well, achieving 100 percent accuracy for the genre most significant to this project, technical paper (33 documents), which refers, among others, to scholarly articles (Boese, 2005:52). Such high accuracies are certainly impressive. However, it must be noted that form features, such as HTML tags, play a large discriminatory role for web classification (Boese, 2005:50). For this research project, which is concerned with a traditional genre, it is rather more useful to consider the accuracies achieved on relevant style and content features. The overall accuracy obtained on the twenty-six punctuation types is 44.6 percent, on the style features is 55.4 percent, and on the top fifty common words in the BNC is 62.4 percent (Boese, 2005:50). These results indicate that punctuation can be a useful discriminator if used in combination with other features (see also Stamatatos et al., 2000a), and further, confirm the findings of Stamatatos et al. (2000a) regarding the effectiveness of the most common words of a language. This section has reviewed the most relevant genre classification studies to this project, in order to establish common practice regarding corpus, linguistic features and training methodology. The review has covered studies on Greek, German, Korean, Russian and Swedish, but it is clear that automatic genre classification studies focus on English in the main. This is favourable for this project, which is also concerned with English, in that comparisons can be more equally made between such studies. Therefore, their features, methodology, and findings can be extended to this research project. CHAPTER 2 46 Previously, in this section, it was noted that there is a lack of standards in automatic genre classification regarding, among others, a benchmark corpus for training and testing. This much is clear from the studies reviewed; however, some trends regarding corpora are evident. An example of such a corpus-related trend is that similar kinds of 'genre' divisions occur in various projects. From this, it would seem that one of the main problems of automatic genre classification is that target genre classes are not clearly defined. This is partly indicative of genre itself not being clearly defined from the offset. Clarity in terms of both genre and target genre classes is required to define the genre classification task, to ensure that the results of the task can be sensibly interpreted. This a priori definition also ensures that the classification task is indeed one of genre classification and not of, for example, content-based classification or register classification. The main trend is to make use of features that are easily extracted or easily computed from the corpus. This results in many features, which are mainly POS, function words (often derived from stopword-lists), keywords, punctuation, readability statistics, and lexical information, such as: average sentence length (in words), average word length (in characters), type/token ratio, total sentence count, total character count, and long words. Behind these features is a preoccupation with proving POS and other linguistic features more discriminatory for genre classification than a BOW approach. There is some evidence in favour of deeper-level features (see Stamatatos, Fakotakis & Kokkinakis, 2000b; also Santini, 2004a) but as these are manually expensive, and because surface features perform well, they are still not widely used. Similar to the need for the a priori definition of genre and target genre classes, features need to be examined before classification in terms of their potential discriminative ability, in order to identify and remove irrelevant features, which can be unhelpful in training. This too does not appear to be widely practiced (see Wolters & Kirsten, 1999; and Stamatatos et al., 2000b), possibly because it is time-consuming. The methodologies used in training classifiers on the basis of selected features differ widely: factor analysis (Biber, 1988), discriminant analysis (Karlgren & Cutting, 1994; Stamatatos et al., 2000a; Stamatatos et al., 2000b; Kelih et al., 2005), ^-nearest neighbour (Wolters & Kirsten, 1999), multiple regression (Stamatatos et al., 2000b), LITERATURE REVIEW 47 logistic regression (Boese, 2005), decision-tree learning (Finn, 2002), and Naive Bayes (Santini, 2004a; Wastholm et al., 2005). From this list it can be seen that the most popular techniques used in these studies are discriminant analysis and Naive Bayes. SVMs are excluded from this list, as studies that make use of SVMs are reviewed in the next section. These studies are reviewed separately because they make use of the machine learning approach adopted in this research project and are therefore to be considered on their own. 2.3.3 Overview of genre classification studies that make use of support vector machines This section discusses five genre classification studies, relevant to this research project in terms of both linguistic features and, more important, methodology. In a similar manner to Section 2.3.1 and Section 2.3.2, studies are reviewed individually and in chronological order. Dewdney et al. (2001) conduct a genre classification study of seven web-specific genres, using the Carnegie Mellon University (CMU) genre corpus (9705 documents). These genres are not relevant to this research project, and are thus not discussed here, but it is important to note that Dewdney et al. (2001) define their intended meaning of genre. They define genre as "a set of conventions in the way in which information is presented" (Dewdney et al., 2001:1). Their notion of genre is too vague to agree entirely with what is meant with genre for this project. Despite their genre types not being relevant to this project, the features and machine learning techniques used are relevant. Dewdney et al. (2001) use two feature sets: the first is a set of eighty-nine linguistic and layout features, and the second is a set of 323 words. Unfortunately, they do not detail all their features, but provide a broad outline (Dewdney et al. 2001:4). From this outline, it can be seen that they make use of the following linguistic features, among others (Dewdney et al. 2001:4): (A) POS; CHAPTER 2 48 (B) closed-word sets, such as days of the week, months of the year, and signs of the zodiac; punctuation; (C) the mean and variance of sentence length and word length; (D) combined sentence length, word length and syllable estimates to provide measures of sentence complexity; (E) Flesch readability metric; and (F) mean word length divided by the mean sentence length. As previously mentioned, this project is concerned with linguistic features, thus the layout features (for example, whitespace, line spacing, and tabulation) used by Dewdney et al. (2001) are not relevant to this project. They train three classifiers, using three different kinds of machine learning algorithms: a Naive Bayes classifier, decision-tree classifier32, and a SVM (Radial Basis Function kernel) classifier.33 SVMs as a machine learning technique will be described and discussed in Chapter 3. Each classifier is trained using each feature set in isolation and then a combination of both feature sets. The classifier trained on SVMs performs the best on the two sets and the combination set (Dewdney et ah, 2001:1). Since none of the genre classes is directly relevant to this research project, the results are reported in terms of overall accuracy. The following results on 10-fold cross-validation are reported, as they are most significant for this research project. On the set of linguistic and layout features, an accuracy of 83.4 percent, a recall of 83.6 percent, a precision of 90.1 percent and a Fl Metric of 86.7 percent are achieved. On the word frequency features an accuracy of 81.7 percent, a recall of 81.8 percent, a precision of 88.4 percent and a Fl Metric 85 percent are achieved. On the combination set, the SVM classifier performs with an accuracy of 83.6 percent, a recall of 84 percent, a precision of 94.9 percent and a Fl Metric of 89.1 percent (Dewdney et al, 2001:7). From these results it can be seen that the classifier performs the best if the words and the linguistic features are used in combination. As mentioned in Section 2.3.2, Finn 32 Quinlan's (1993) C4.5 decision-tree learner was used. 33 The implementation used is SVMUght (Joachims, 1999a). LITERATURE REVIEW 49 (2002:76) concludes that a combination of feature sets is likely to produce the most favourable results. This has been demonstrated by Dewdney et al. (2001) and provided further motivation for combining the features used in this project, which are representative of linguistic features and the BOW approach. Argamon and Dodick (2004a)34 undertake a genre classification study, more akin to this research project, from a more clearly defined linguistic approach than any other genre classification study. They approach genre classification from an SFL (systemic functional linguistics) perspective, examining systemic preferences across what they term "scientific genres" (Argamon & Dodick, 2004a:2). In particular, they wish to examine writer's stance towards assertions in the text and cohesive strategies in two journal genres: two paleontological journals, Palaios and Quaternary Research (222 documents); and two physical chemistry journals, Journal of Physical Chemistry A and Journal of Physical Chemistry B (238 documents). At first glance, these genres may perhaps seem contrived and relating to content rather than genre (see the examples of genres presented in Section 2.2; also Couture, 1986:87). Argamon and Dodick (2004a), however, demonstrate the contrary. These two journal genres are indeed distinguishable, based on 101 features. The features are frequency counts of keywords and phrases of a functional nature, which were assembled independently of the corpus and therefore, less likely to overfit the training data. Argamon and Dodick (2004a: 3-4) examine these features in terms of systems and subsystems: conjunction (elaboration, extension, and enhancement), modality (type, value, orientation, and manifestation), and comment (admissive, assertive, presumptive, desiderative, tentative, validative, evaluative, and predictive). Argamon and Dodick (2004a) use a SVM (linear kernel)35 to train the classifier.36 They report accuracies (on 20-fold cross-validation) of between 83 and 91 percent in distinguishing between the two journal genres. These good accuracies have two implications for this research project. Firstly, the study demonstrates the good performance of SVMs in genre classification and thus motivates the use of this machine 34 See also Argamon and Dodick (2004b). See Chapter 3 for more on kernels. The classifier was trained using SMO learning algorithm (Platt, 1999) implemented in the Weka package (Witten & Frank, 1999). CHAPTER 2 50 learning technique. Secondly, the good discriminatory ability of the features used by Argamon and Dodick (2004a) provide additional encouragement for the inclusion of many of these words and phrases as features (see Chapter 3 for more on features used in this project and their motivation). Argamon and Dodick (2004a) provide proof that rhetorical differences can be extracted, using easily computed linguistic features. They also show that the features used for automatic genre classification need not be dichotomous, that is, BOW versus purely linguistic features, but that words and phrases indicating underlying structure can be used to great success. Meyer zu Eissen and Stein (2004) conduct a genre classification study, using eight genres, seven of which are web-specific (800 documents). They do not provide a clear definition of what they mean by genre but give examples of genres, such as letter and editorial, that agree with the usage of the term in this research project. Yet, it appears that they do not use genre in the same way as is used for this research project. This is because their web-'genres' seem to rather be webpage types, with the exception of help and articles. Even, articles, although superficially appearing to be a genre, is not a single genre because it includes research articles, reviews, technical reports, and book chapters. Nevertheless, this 'genre' is the one that is the most applicable to this research project. Articles consists of 100 documents that were used for training and testing. Meyer zu Eissen and Stein (2004) distinguish their 'genres' based on four feature types (mainly normalised counts) that they combine into two sets according to computational ease. They are (Meyer zu Eissen & Stein, 2004:264-265): (A) Set one 1. Presentation-related features, which are HTML-specific and represent the appearance of a document.37 2. Closed-word sets a. average word frequency class; b. currency symbols; c. help symbols; d. shop symbols; e. date symbols; 37 Presentation features are not detailed here, as they are not relevant to this project; see Meyer zu Eissen and Stein (2004:264) for more on these features. LITERATURE REVIEW 51 f. first names; g. surnames; and h. words that do not appear in Webster's dictionary. 3. Text statistics a. punctuation: question marks, colons, semicolons, dots, commas and exclamation marks; b. letters; and c. digits. (B) Set two 1. POS38 a. nouns; b. verbs; c. relative pronouns; d. prepositions; e. adverbs; f. articles; g- pronouns; h. modals; i. adjectives; and j - alphanumeric words 39. They use the two feature sets separately in each experiment, in order to establish whether the discriminatory gain of Set Two outweighs its computational (and time) expense (Meyer zu Eissen & Stein, 2004:264). They initially explored the classification performance of these features using discriminant analysis (Meyer zu Eissen & Stein, 2004:266). Their genre classifier was trained, using neural networks and SVMs. On average, for one-against-all classification,40 they report an accuracy of 70.7 percent for Set One and an accuracy of 72.7 percent for Set Two. These scores are for the classifier trained, 38 The corpus was tagged using the part-of-speech tagger of the University of Stuttgart available at http://www.ims.uni-stuttgart.de. This is also not clearly related to POS but is grouped as such in the original. 40 This is a technique used for multiclass SVMs. Refer to Scholkopf and Smola (2002:211-212) for more on this. CHAPTER 2 52 using SVMs only, as they do not report accuracies for the neural networks learner, noting only that SVMs have the more accurate performance (Meyer zu Eissen & Stein, 2004:266). This conclusion and the superior performance of Set Two motivated the use of SVMs and POS features in this research project. Aires, Aluisio, and Santos (2005) explore their genre classification, using the Brazilian Portuguese Lacio-Ref corpus (4278 documents).41 This corpus is divided into five 'genres' and thirty 'text-types'. Their use of text-type corresponds to the meaning of genre in this project, while their use of genre does not; for example, they consider poem and dissertation as instances of text-types, and scientific and instructional as instances of genres. They use forty-six features, derived from Biber (1988) and Karlgren (2000), detailed below (Aires et al., 2005:2-3). (A) Word-based statistics 1. type/token ratio; 2. capital type/token ratio; 3. average word length in characters; and 4. long words (more than six characters) count. (B) Text-based statistics 1. character count; 2. average sentence length in characters; 3. average sentence length in words; 4. sentence count; and 5. text length in words. (C) Other statistics 1. subjective markers; 2. Portuguese specific words, such as que and se; 3. discourse markers; 4. w/i-questions; 5. amplifiers; 6. downtoners; For more on this corpus see Aluisio, Pinheiro, Finger, Nunes and Tagnin (2003). LITERATURE REVIEW 53 7. emphatics; 8. verbs: suasive, private and public;42 9. articles: definite and indefinite; 10. pronouns: first person, second person, third person, demonstrative, indefinite, and pronominal expressions; 11. prepositions; 12. adverbials: place and time; 13. adverbs; 14. interjections; 15. contractions; 16. conjuncts; and 17. conjunctions: causative, final, proportional, temporal, concessive, conditional, conformative, comparative, and consecutive. Using these features, Aires et al. (2005) train three classifiers using three different learning techniques: SVMs43, decision trees44 and logistic regression45. The most relevant results for this research project are those reported on the SVM classifier on the 'text-types', because this corresponds more closely to genre in this project. On 10-fold cross-validation of these 'text-types', Aires et al. (2005:3) report a 55 percent precision, a 91 percent recall and a 69 percent F-measure, using the SVM classifier. In Chapter 4, it will be seen that the most relevant evaluation measure for this research project is recall. Comparison to the decision-tree classifier, which achieves a 67 percent recall, and the logistic regression classifier, which achieves a 74 percent recall, shows that the SVM classifier outperforms these two classifiers. Aires et al. (2005) clearly demonstrate the usefulness of linguistic features for genre classification, not only for English, as many studies have done, but also for Portuguese. In addition to this, the good performance of SVMs as a technique for training a classifier is confirmed. 42 These verbs will be explained In Chapter 3. 43 The SVM used is the SMO algorithm, and like Argamon and Dodick (2004a) it is implemented in Weka. 44 They use J48, a Weka implementation of C4.5. 45 The LMT algorithm is used (Witten & Frank, 1999). CHAPTER 2 54 In a follow-up study to Santini (2004a), reviewed in Section 2.3.2, Santini (2005b) further investigates the possibility of deeper syntactic structures as good discriminatory features for genre classification. In this study, she clearly defines what she means by genre and text-type; moreover, she uses both these terms in agreement with what is meant in this research project (Santini, 2005b:3). She uses two types of features, which she terms "linguistic facets" (Santini, 2005b:3). They are functional and syntactic cues, which, as with those of the majority of automatic genre classification studies, can all be automatically extracted. The functional cues are extracted using the Connexor parser (Tapanainen & Jarvinen, 1997). These functional cues are (Santini, 2005b:8-18):46 (A) predicators; (B) complex NPs; (C) nominals; (D) pronouns: first person, second person, third person, and third person singular inanimate; (E) present tense group; (F) past tense group; (G) imperatives; (H) active; (I) passive; (J) negative particles; (K) existential there; (L) expressiveness; (M) markers: time, location, instrument, probability, necessity, and manner; (N) verbs: activity, communication, mental, causative, occurrence, existence, and aspectual; and (O) connectives: enumerative, equative, reinforcing, summative, appositive, resultative, inferential, reformulatory, replacive, antithetic, concessive, discoursal, and temporal. Her work is very comprehensive, demonstrating each cue type with an example. LITERATURE REVIEW 55 The twenty-five types of syntactic cues represent syntactic patterns created by finding common patterns from the parser output, and then writing an algorithm to detect these patterns, using regular expressions (Santini, 2005b: 19-22).47 They are (Santini, 2005b: 18-38): (A) Adverbial clauses: concession clause, conditional clause, contrast clause, exception clause, purpose clause, reason clause, result clause, similarity manner comparison clause, space clause, and time clause. (B) Complement/nominal clauses 1. vevb+that clause; 2. adj ective+tfiar clause; 3. that omission; 4. w^-clause; 5. verb+to clause; 6. adjective+fo clause; 7. verb+/«g clause; 8. comparative clause; and 9. relative clause. (C) Simple sentences: phenomenon registering, action recording, phenomenon identifying, phenomenon linking, quality attributing, and action demanding. Using eighty-four linguistic facets of the types listed above, as well as 211 POS trigrams, Santini (2005b) trains two SVM classifiers to distinguish between four web- specific genres of 200 webpages each.49 The average accuracy she reports over ten test sets, and using ten different seeds is 84.28 percent on the linguistic facet set and 86.50 percent on the POS trigrams set (Santini, 2005b:38-39). Santini (2005b), similar to Argamon and Dodick (2004a) is linguistically very thorough and shows that surface words can be indicative of deeper structure. This study (Santini, 2005b) provided two strong influences for this project. Firstly, the high accuracy on set one provides good evidence for linguistic features as genre indicators; especially considering that the set used was a small feature set (Santini, 2005b:39). Secondly, the 47 See Chapter 3 for more on the use of regular expressions in this research project. See also Friedl (1997) for more on regular expressions. 48 Again, the training is done using the Weka package (Witten & Frank, 2000). 49 No more detail is provided regarding the corpus. CHAPTER 2 56 performance of Santini's (2005b) classifiers gives definitive proof of the good performance of SVMs as a machine learning technique. This section has discussed five automatic genre classification studies that use SVMs to develop genre classifiers. These studies show that when compared with classifiers trained on the same data but using different learning techniques, SVMs outperform other learning techniques. Furthermore, two findings are relevant for this project. First, Dewdney et al. (2001) demonstrate that words, that is, the BOW approach, and linguistic features can produce better results when used in combination than when used on their own. Second, Argamon and Dodick (2004a), and Santini (2005b) show that the features used for automatic genre classification need not be dichotomous, that is, BOW versus purely linguistic features, but that words and phrases can be indicative of deeper linguistic structure and as such can be used with much success. 2.4 Summary This chapter commenced with an illustrated explanation of the notions of machine learning and supervised learning, and introduced the basic notation used in this project for learning by examples. These concepts were placed in the framework of automatic genre classification and the task of genre classification for this project explained. Thereafter, the use of genre in this project was defined and the genre examined in this project identified in terms of prototypicality. Next, a review of automatic genre classification was presented, in order to detail the background to the features and methods used in this project. Each study was presented separately, for two reasons: firstly, because of variation between data, features, application, and rationale; and secondly, because of the lack of standards in automatic genre classification, regarding a benchmark corpus, evaluation metrics, baseline performance and methodology. However, where possible, comparisons were made between the projects reviewed. In addition, each statistical technique used in the reviewed studies was explained, and the value of the studies emphasised with particular focus on the value of the linguistic features used for this project. Moreover, these studies were critiqued in terms of validity of pre-defined genre classes, results, evaluation measures, and features. LITERATURE REVIEW 57 In this review, previous relevant work in automatic genre classification and seminal works in the field were first reviewed. The most important study reviewed was Biber's (1988) seminal text variation work that forms the foundation from which many automatic genre classification studies have sprung. Thereafter, more current automatic genre classification studies were reviewed, with detailed reference to projects that are relevant and have a similar purpose to this research project, regarding application, corpus, features, and method. Finally, those studies that use SVMs for machine learning, for the purposes of genre classification, were detailed. In spite of a lack of standards in automatic genre classification regarding, among others, a benchmark corpus for training and testing, some trends regarding corpora are evident. For example, the same kinds of 'genre' divisions occur in various projects, mainly because the same corpora are used (such as the Brown Corpus). These divisions are often not based on genre distinctions but rather on topic or discipline distinctions. It can be seen that poorly defined genre classes stem from an unclear definition of genre at the offset. At times, this results in low classification accuracies, and this also makes it difficult to deduce the reasons for low accuracy (for example, Karlgren & Cutting, 1994). It is thus important to clearly define genre before classification, and furthermore, to have clear target genre classes. This will mean that the object of classification is elucidated, and that results can be interpreted in light of the genre class. Such a priori clarification of the genre task ensures that the classification task is indeed one of genre classification and not of, for example, content-based classification. The studies reviewed in this chapter mainly use features that are easily computed and extracted from the corpus. These features are mainly POS, function words (often derived from stopword lists), keywords, punctuation, readability statistics, and text statistics; such as, average sentence length (in words), average word length (in characters), type/token ratio, total sentence count, total character count, and long words. A concern with proving POS and other linguistic features better features for genre classification than a BOW approach underlies these features. There is also some evidence in favour of deeper-level features (see Stamatatos, Fakotakis & Kokkinakis, 2000b) but these are still not widely used, because they are manually expensive. Wolters and Kirsten (1999) and Stamatatos et al. (2000b) show that features need to be CHAPTER 2 58 examined before classification to determine their discriminative ability. Such examination needs to be done, in order to remove irrelevant features, which can be unhelpful in training. Pre-classification exploration does not appear to be widely practiced, possibly because it is time-consuming. A variety of learning methodologies are used in training classifiers, the most popular techniques of which are discriminant analysis (Karlgren & Cutting, 1994; Stamatatos et al, 2000a; Stamatatos et al, 2000b; Kelih et al, 2005), SVMs (Dewdney et al, 2001; Argamon & Dodick, 2004a; Meyer zu Eissen & Stein, 2004; Aires et al, 2005; Santini, 2005b) and Naive Bayes (Santini, 2004a; Wastholm et al, 2005). The following concerns, relevant to this project were raised in the chapter: cross-domain accuracy, training sample size, evaluation of the discriminant abilities of features, normal distribution, parametric tests, and non-parametric tests. These will all be addressed in terms of this project in Chapters 3 and 4. Furthermore, based on the results obtained on SVMs and on various linguistic features in the studies that were reviewed, particular linguistic features were included in this project. The features that were particularly emphasised in this chapter were POS, the top fifty words of the BNC, punctuation, and various function words. These features will be discussed further in Chapter 3, which will detail the steps followed in developing the SVM classifier, regarding the training and testing corpus, feature selection, feature extraction and the learning methodology used to train the genre classifier. DEVELOPING THE CLASSIFIER 59 CHAPTER 3 Developing the classifier Features do not randomly co-occur in texts. If certain features consistently co-occur, then it is reasonable to look for an underlying functional influence that encourages their use (Biber, 1988:13) 3.1 Introduction The aim of this project is develop a classifier that distinguishes between 'good' and 'bad' examples of argumentative essays. It must learn to do so, using the assessment of each essay, the dependent variable, according to human evaluators (the original markers of the essays), and linguistic features, the independent variables, (henceforth, only features) of the text. In general, certain steps need to be followed, in order to develop such a classifier. First, a learning method must be selected by which to train the classifier (the LM ), to distinguish between 'good' and 'bad' examples of the argumentative essay genre. Second, a corpus, which will provide the essays and thus the data for the classifier, is required. Third, the salient and relevant features that will allow the classifier to categorise 'good' and 'bad' examples of argumentative essays must be selected. Fourth, the corpus must be prepared in various ways, for example, the essays must be converted to text format. Fifth, the features that are to be extracted must be annotated. Sixth, the features must be extracted from the essays, in order to gather frequency counts on all the features in each text. Seventh, the data must be prepared before classification in different ways, for example, the data set must be transformed. Lastly, the classifier needs to be trained and tested, using the features and learning method selected. Following steps one to eight, this chapter details all the information relating to the corpus, feature selection and extraction, and learning methodology. The chapter commences with Section 3.2, which discusses the corpus and provides the background to the selection of the essays used. In Section 3.3, the features used for training the classifier are described. In Chapter 2, many of these features were already described in the context of previous research. Hence, only some necessary details as to the reasoning CHAPTER 3 60 for the selection of each feature and examples of studies, which make use of each particular feature are given. Thereafter, Section 3.4 delineates the processes of text preparation before feature extraction. Next, Section 3.5 discusses the annotation of various features of the essays, in terms of POS tagging and XML mark-up. After providing a description of annotation, the preparation of the data before classification is described in Section 3.6, in terms of the standardisation of dependent variable, the removal of multiple occurrences of terms, data transformation, and feature selection. Lastly, the machine learning technique used to train the classifier is presented and explained in Section 3.7. 3.2 The corpus The corpus is made up of 346 (813987 words) untimed essays from a section of the British Academic Written English (BAWE) corpus (Nesi, Sharpling & Ganobcsik- Williams, 2004), which are freely available to researchers, but cannot be reproduced here. l All the essays are written in Standard British English by native speakers. Furthermore, these essays are all instances of argumentative essays written by students within an academic context. They are drawn from across the faculties and represent essays from many of the disciplines, except Engineering; and from most subjects, except Chemistry, as argumentative writing was not present among these essays/assignments. For purposes of further research, a list, identifying which essays were used in the research project and their respective grading, is available in Appendix 1. In this section, the necessary background to the corpus used for this research project has been delineated. The next section discusses the features used in training the genre classifier. The corpus was developed at the Universities of Oxford Brookes, Reading and Warwick under the directorship of Hilary Nesi (Warwick). Corpus development was assisted by funding from the ESRC (RES-000-23-0800). 51 More details can be found at http://www.warwick.ac.uk/go/bawe/. 52 For a complete list of departments/courses please see Appendix 1. DEVELOPING THE CLASSIFIER 61 3.3 The features" The features used to train the learner to distinguish between 'good' and 'bad' examples of the argumentative essay genre were derived from previous automatic genre classification studies (particularly Biber, 1988), reviewed in Chapter 2. The feature selection was also informed by two well-known English grammar books: Biber et al. (1999), and Quirk et al. (1985). Furthermore, many of the features were selected from the literature relating to academic discourse in general, chiefly Biber et al. (2004). The features were extracted, using STATISTICA Text Mining and Document Retrieval (Statsoft, 2006). The text miner indexes all words found in the input documents, and computes a table of documents and words, a frequency matrix enumerating the number of times each extracted word occurs in each document. If all the words in a document are not to be indexed, it has an option for specifying the words and phrases to be extracted. It also provides an option for only those words that occur in a selected percentage of files to be extracted, and provides a stemming function54 that is particularly helpful in extracting the different word forms of verbs. Once the raw word frequencies have been computed, STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006) offers three transformations: logarithmic frequencies, binary frequencies and inverse document frequencies. A list of all the extracted features, as well as some examples of each feature type, follows below. 3.3.1 Parts-of-speech Words that are alike in grammatical function are categorised into groups such as verbs, nouns, adjectives, adverbs, and prepositions. These word classes are traditionally referred to as parts-of-speech (Halliday & Matthiessen, 2004:50-51). Parts-of-speech have been used in many automatic genre classification projects with good results; see for example, Finn (2002); Biber (1988); Karlgren and Cutting (1994); also Kessler et al. (1997). More information on POS tagging and the tagset used is given in Section 3.5. 53 All complete feature lists can be found in Appendix 2. 54 Stemming refers to the reduction of words to their roots, in order to count the different word forms of the words to be extracted, as instances of those words (Statsoft, 2004). CHAPTER 3 62 3.3.2 Punctuation marks Terminal punctuation (full stops, question marks, and exclamation marks) and clausal punctuation (semi-colons and colons) were counted. The exclamation mark is a surface cue indicating emotive sentences and imperatives. Similarly, full stops indicate declarative sentences, and question marks indicate interrogative sentences. Interrogatives may prove to be a relevant feature for academic writing because, according to Hyland (2002a:530), direct questions are used by writers of academic prose to engage readers and draw them into their argumentation and line of reasoning. Little attention has been paid to direct questions in academic writing, because they are perceived to be rarely used in this discourse (Hyland, 2002a:530). Hyland (2002a:530) however, finds that despite their relatively low frequency in comparison to their frequency in conversation, they do occur in academic writing. In addition, semi-colons and colons may prove to be of relevance as they have been found to be indicative of complex sentences in previous research (Rauber & Miiller- Kbgler, 2001:5). As with parts-of-speech, punctuation has also been used in many automatic genre classification studies. Examples of these studies include Boese and Howe, (2005), Forsbom (2005), Lee and Myaeng (2004) and Stamatatos et al. (2000b). In punctuation mark-up, commas were excluded because some manipulation of commas was required during sentence mark-up (see the mark-up scripts on the accompanying CD). Punctuation symbols such as $ and £ were not considered as they did not occur frequently in the essays being examined. More details regarding the mark-up of punctuation marks is given in Section 3.5. 3.3.3 Quotations Many automatic genre classification studies that use punctuation as features make use of counts on single and double quotation marks as punctuation marks, such as, Lee and Myaeng (2004) and Bisant (2005). Others, for example, White, Cardie, and Ng (2002) gather counts on the presence of quotations. It appears that no automatic genre classification studies accord quotations special status. DEVELOPING THE CLASSIFIER 63 In this study, however, more information on quotations was gathered: quotations were categorised and counted as two different types: integrated and non-integrated (following Nelson, 2002). Integrated quotations were regarded as those which were incorporated into the quoting author's sentence (refer to Section 3.3.5.5 for a definition of sentence). Those, generally longer, quotations that were not incorporated into the quoting author's sentence, and which, therefore, made up a sentence or sometimes a paragraph on their own were considered non-integrated quotations. This study considered quotations and the verbs associated with them (see reporting verbs in Section 3.3.6.4) as potentially helpful features for distinguishing between prototypical and non-prototypical argumentative essays because, according to Hyland (2002b: 116), reporting is an important convention of academic writing. Reporting shows the following principal quoting structures: direct quotation, paraphrase, summary, and generalisation (Hyland, 2002b: 116). As this project is concerned with features that are easy to extract, direct quotations and references are extracted. It is assumed that references are indicative of the remaining three quoting structures given by Hyland (2002b: 116): paraphrase, summary, and generalisation. 3.3.4 Nominalisations According to Halliday & Matthiessen (2004:656), nominalisation is the "single most powerful resource for creating grammatical metaphor". Grammatical metaphor is "associated with the discourses of education and science" (Halliday & Matthiessen, 2004:636). It refers to one type of process (mental, behavioural, verbal, relational, and existential) being presented as another type of process. Grammatical metaphor can also refer to two or more types of processes being represented as a single process, which is not of the same type of either of the original processes (Halliday, 1987:76). Grammatical metaphor serves to background information, and is important for establishing context at the discourse level (Halliday, 1987:78). Its widespread use in writing thus deemed it a potentially relevant feature for this project. Only nominalisations representative of grammatical metaphor were considered here because they can be extracted automatically. More specifically, nominalisations are a kind of ideational metaphor where, for example, processes and qualities are represented as if CHAPTER 3 64 they were entities (Halliday & Matthiessen, 2004:637; see also Biber et al, 1999:322 for more examples), for example rationalisation and demonstration. In addition to the role of nominalisations as grammatical metaphor, Chafe and Danielewicz (1987:98) claim that nominalisations are a linguistic device used to enlarge the size of intonation units. Along with nominalisations, prepositions and attributive adjectives, two linguistic devices that also effect intonational unit size, are "unusually frequent" in academic writing. These features are similar to Chafe's (1982) notion of integration (integration/fragmentation dimension), which is characterised by features that "pack information into a text" (Biber, 1988:21), such as nominalisations, prepositional phrases and attributive adjectives. In addition to this dimension, Biber (1988:21) proposes the detachment/involvement dimension (based on Chafe, 1982), in which detachment, intuitively expected to characterise writing in an academic context,55 is also typically marked by nominalisations. The above arguments for the important role of nominalisations, both at the sentence and discourse level, deemed nominalisations a potentially genre-revealing feature. Counts on nominalisations were extracted by matching nouns with suffixes -tion, -ment, -ness, and -ity (Biber et al, 1999:323). These particular nominalisations were selected for two reasons: firstly, because of their high occurrence in academic discourse in the Longman Spoken and Written English (LSWE) corpus (Biber et al., 1999:322); and secondly, because they are suffixes that are, in the main, used to create abstract nouns. Extracting abstractness is relevant for this project as it is one defining characteristic of written discourse (Biber, 1988:47).56 Written academic discourse, in particular, has been found to be highly abstract (Biber, 1986). Biber (1986:393, 395) finds that nominalisations, prepositions and passives57 tended to co-occur and interprets these as representative of highly abstract texts (see Biber, 1986:395 for an intuitively attractive discussion of the causal relationship between these three constructions). Chafe (1982) describes academic writing as being characterised as detached and integrated. 56 See Blankenship (1974) and Chafe and Danielewicz (1986). 57 Passives are not used as features here, because they could not be easily automatically extracted (see for example Biber, 1988). DEVELOPING THE CLASSIFIER 65 3.3.5 Text statistics 3.3.5.1 Word count Word counts were extracted, making use of Oxford WordSmith Tools 4.0 (Scott, 2004). What constitutes a word was mainly determined by SVMTagger , as the number of POS tags and the token count of each document had to be kept the same. Therefore, numbers were considered and counted as words; amounts given with their currency in numerals, for example £2, were considered one word; and hyphenated words were considered one word. An exception to keeping POS tags and tokens equal in one document were apostrophes. SVMTagger tags apostrophes indicating possession separately, but not apostrophes indicating omission, for example, the student's essay — student NN 's POS. Therefore, apostrophes were allowed in words and at the end of words. This feature was not used directly but rather as a calculation of sentence length and, as is shown in Section 3.6, for normalising feature counts to text length. 3.3.5.2 Word length Word length is one of the most commonly used features in automatic genre classification (see for example Aires et al., 2005; Bisant, 2005; Braslavski & Tselischev, 2005; Dewdney et al., 2001; Finn, 2002; Karlgren & Cutting, 2004; Kessler et al., 1997; and Santini, 2006b). The average word length in characters of each essay was extracted, making use of Oxford WordSmith Tools 4.0 (Scott, 2004). 3.3.5.3 Long words Biber (1988) asserts that long words have "more specific, specialised meanings" than shorter words. Long words, which are defined in the literature as words longer than five, six or seven characters, have been found useful in automatic genre classification studies, for instance Aires et al. (2005), Finn (2002), Karlgren and Cutting (2004). This study considered those words longer than six characters. The raw frequencies of long words in This POS tagger is discussed in Section 3.5.1. CHAPTER 3 66 each essay were extracted and counted, using Oxford WordSmith Tools 4.0 (Scott, 2004). 3.3.5.4 Type/token ratios (TTR) TTRs provide a measure of the number of different types of words in a text and thus potentially measure the "richness of vocabulary" (Chafe & Danielewicz, 1987:91) of texts. Since academic writing displays high lexical diversity, this measure can prove to be a particularly salient measure of prototypical and non-prototypical argumentative essays. They are widely used as a feature in the studies reviewed; Kessler et al. (1997) and Stamatatos et al. (2000b) provide some discussion of the TTR as a relevant feature. The TTR is only helpful if texts of equal sizes are being compared, as the relation between the number of different lexical items and the total number of words in a particular text is not linear. This means that a large number of different lexical items that occur in the first 100 words of a text will be repeated in the next 100-word chunk, thus the different types will be reduced, as the length of the text increases (Biber, 1988:239). Therefore, if the TTR of a very short text is compared to that of a very long text it will appear that the TTR of the short text is much higher. A solution to this is to standardise the TTR to make for an informative and comparable measure. As this research project makes use of full texts and not abstracts, the texts are of varying lengths. Thus, the TTR of each text was standardised, using Oxford WordSmith Tools 4.0 (Scott, 2004), which calculates the number of types to tokens every n words and then computes a running average over each time it recalculates according to the size of n. It should be noted that this is just one method among others of calculating the standardised TTR (STTR) (for another method see Tuldava, 1995:131-150). TTRs were extracted, using Oxford WordSmith Tools 4.0 (Scott, 2004) and standardised for 800 words. The shortest texts were just over 800 words long, therefore n = 800 was selected. If the default, n = 1000 were used, these shorter texts would get STTR = 0, which would, of course, not be helpful. In this case, where n = 800 the TTR is calculated anew with every 800 words, and the average TTR over all the calculations is given as the STTR. DEVELOPING THE CLASSIFIER 67 3.3.5.5 Sentence count Defining sentences is a troublesome problem in Linguistics (see Roberts, 1960), but it did not suit the purposes of this project to try to incorporate many definitions of sentences. It rather seemed that what was most important was that the chosen definition be applied consistently throughout the texts. In addition, considering the computational nature of this project, a definition, which would allow easy automatic retrieval, seemed the most sensible. Sentences were thus defined as beginning with a capital letter and ending with a terminal punctuation mark: full stop (.), exclamation mark (/) and question mark (?). Gimenez and Marquez (2006:16), the creators of SVMTagger also take terminal punctuation to be "unambiguous sentence separators". This further confirmed the definition of sentence for this project (see also Halliday & Matthiessen, 2004:6; and Lebart, Salem & Berry, 1998:36). Sentences were first marked up and then extracted from the texts (see Section 3.5 below for details regarding sentence mark-up). 3.3.5.6 Sentence length in words As with many of the other lexical ratios, that Kessler et al. (1997:34) term "derivational cues", sentence length is commonly used as a feature in automatic genre classification (Kessler et al., 1997, and Karlgren & Cutting, 1994). The average sentence length per document was calculated by simply dividing the number of tokens by the number of sentences in a text. 3.3.5.7 Paragraph count As the next level in the graphological hierarchy of language (Halliday & Matthiessen, 2004:6-7), paragraphs seem an intuitive feature for inclusion, yet paragraph counts were not a widely used feature in automatic genre classification. It is interesting to note that besides paragraphs being graphological units in texts, they can be viewed as a kind of punctuation marker (Nystrand, 1987:209). This viewpoint extends back to the origin of a paragraph; it was a symbol noted in the margin "to indicate conceptual, narrative, and other shifts in the flow of discourse" (Nystrand, 1987:209). More information on CHAPTER 3 68 how paragraphs were distinguished and marked-up can be found in Section 3.4 and Section 3.5. 3.3.5.8 Paragraph length in sentences In a similar manner to the calculation of sentence length, average paragraph length was calculated by dividing the number of sentences by the number of paragraphs in each text. 3.3.5.9 Readability scores Readability tests are intended to provide an indication of the reading level of a text. Some readability scores, such as the Flesch-Kincaid Grade Level Score, give an indication of the reading level in terms of the grade level score, which is useful in secondary educational contexts. Other readability scores indicate the reading level according to a difficulty/ease scale. This type of readability score is suitable for the purpose of this project. Grade level scores are not helpful for essays at tertiary level, as the essays are all at a high reading level and would thus be equally scored. Two widely used measures of readability for English were selected: the Automated Readability Index (ARI) and the Flesch Reading Ease score. The ARI (Smith & Kincaid, 1970) formula is given by: ((tc/twx4Jl) + (tw/tsex0.5))-2l.43, (3.1) where tc is the total amount of characters, tw is the total amount of words, and tse is the total amount of sentences. The Flesch Reading Ease (Flesch, 1948) formula is given by: 206.835-1.015 (tw/tse)-S4.6 (tsy/tw), (3.2) where tw is the total amount of words, tse is the total amount of sentences, and tsy is the total amount of syllables. Both these scores were calculated for each text (see Dewdney et ah, 2001; also Boese, 2005 for examples of automatic genre classification studies making use of readability statistics). DEVELOPING THE CLASSIFIER 69 3.3.6 Word lists 3.3.6.1 Key function words Keywords are typically used in text classification in general and in automatic genre classification. Words are considered as keywords not simply because they occur frequently in a text, but rather because they occur with unusual frequency in a text (or corpus) as compared to another text (or corpus). This means, that in order to establish keyness, a suitable comparison corpus must be identified and a word list extracted from it for comparison. The corpus used in this project was measured against the written section59 of the BNC, in order to establish keyness. The BNC was chosen as the comparison corpus because it represents the writing of native speakers of British English, which is the language of the project's corpus. Furthermore, the BNC represents texts from a variety of genres and across a broad spectrum of topics. It was thus deemed representative of the language being examined, and suitable for comparison. The key function words were taken from the top thousand keywords in the corpus and they were then extracted, using Oxford WordSmith Tools 4.0 (Scott, 2004). Function words were defined as all words other than nouns and verbs. This project took the top key function words because it seemed more likely that function words would be better indicators of style and genre than, for example nouns, which are more expressive of content. 3.3.6.2 The top fifty most frequent words in the BNC These words were selected based on Stamatatos et al. (2000a), who, in their study, found the most common words of the BNC to be good discriminators of genre. The list was recalculated from the unlemmatised word frequency list, selected from the written section of the BNC for this research project, using Oxford WordSmith Tools 4.0 (Scott, 2004). As Stamatatos et al. (2000a) considered 's and n't, which this project did not include, some minor differences in the final word list resulted. 59 The entire spoken section of the BNC was excluded, and written scripts, which were to be read, for example, news scripts were removed. CHAPTER 3 70 3.3.6.3 Prepositions As mentioned earlier, prepositions, as with nominalisations, are linguistic devices used to increase the size of intonation units. It was also mentioned previously that prepositions occur with unusual frequency in academic prose (Chafe & Danielewicz, 1987:98). Biber (1988:237) further reinforces the importance of prepositions in academic discourse, claiming that they are linguistic devices "packing high amounts of information" into academic writing. It can thus be expected that a large number of prepositions would characterise prototypical argumentative essays. For their potentially significant role in distinguishing between prototypical and non-prototypical essays, prepositions were extracted as features of the essays. Three lists of prepositions were used: simple prepositions (Biber, 1988:236-237), two-word complex prepositions, and three-word complex prepositions (Quirk et al., 1985:669-671). The particulars of extracting two-word and three-word prepositions are given in Section 3.5. 3.3.6.4 Reporting verbs As this project considers quotations to be potentially helpful discriminators, reporting verbs that are associated with these quotations were also considered. According to Hyland (2002b: 116), reporting verbs indicate a significant rhetorical choice as they allow writers "to convey both the kind of activity reported and whether the claims are to be taken as accepted or not". For example, demonstrate and show can be used to indicate the writer's agreement, whilst hedges, such as suggest and imply indicate a more hesitant rhetorical choice. Following Biber (1988) and Aires et al. (2005), reporting verbs, which Biber (1988) terms specialised verb classes, are used in this project. They are derived from Quirk et al, (1985:1180), who define two main categories of superordinate verbs, the first of which they term factual, as it accompanies an indicative verb and introduces propositional information. They further divide factual verbs into "public" and "private" (Quirk et al., 1985:1180). This division is based on intellectual states that are observable DEVELOPING THE CLASSIFIER 71 (public) and not observable (private), examples of public verbs are affirm and mention, while examples of private verbs are assume and establish. Quirk et al. (1985:1180) term the second category suasive. Verbs of this category "imply intentions to bring about" a future change of some kind (Quirk et al, 1985:1180), examples are intend and propose. A list of factual and suasive verbs, derived from Quirk et al. (1985:1181-1182) associated with quotations (allowance was made for literary extracts and quotations as well) and paraphrasing of citations, was compiled. The category of perception verbs from Quirk et al. (1985:1033) is also included here as they are indicative of academic hedging (Biber, 1988:242), examples are seem and appear, see Section 3.3.6.6 below. 3.3.6.5 Conjunction In English, cohesion is created by conjunction, reference, ellipsis, and lexical organisation (Halliday & Matthiessen, 2004:533). This means that conjunctions have an important function in providing cohesion to texts. In fact, cohesive relationships are one of the three areas that English (1999:22-24) regards as essential in her analytic framework of assessing student academic writing. This research project considers three kinds of conjunction: conjunctive adjuncts, coordinating conjunctions, and subordinating conjunctions (delineated below). 3.3.6.5.1 Conjunctive adjuncts (conjuncts) Conjunctive adjuncts, also known as linking adverbials (Biber et al, 1999:558-560) relate clauses to each other. Establishing relationships between clauses is a component of the system of cohesion. Moreover, Biber (1986) finds that they occur with high frequency in academic discourse. This renders them a potentially relevant indicator of prototypical argumentative essays. Henning (2006) provides further evidence for this feature as essential to the coherence and cohesion of student argumentative essays. CHAPTER 3 72 The word list used is derived from Quirk et al. (1985:631-640) and Biber (1988:239). Similar lists can be found in Halliday and Matthiessen (2004:82). Examples are in comparison, notwithstanding, altogether, and similarly. 3.3.6.5.2 Coordinating conjunctions Coordinating conjunctions, also known as coordinators, join independent clauses, and are similar in function to linking adverbials in that they too link clauses. They differ, however, in that they can only occur at the clause boundary. For the sake of easy automatic extraction, no distinction was made between phrasal and clausal coordination. Furthermore, only simple coordinators were considered because they are the most frequent coordinators, according to Biber et al. (1999:79-83): and, but, and or. The coordinators and and or are particularly more frequent in academic prose than in the other registers they examine (Biber et al., 1999:81). 3.3.6.5.3 Subordinating conjunctions Subordinating conjunctions, also called subordinators, introduce dependent clauses. The list of subordinators was drawn from Biber et al. (1999:85) and Quirk et al. (1985:1078-1079). Examples are although, since, when, and whilst. 3.3.6.6 Hedges Hedging, a term originally introduced by Lakoff (1973), is claimed by Hyland (1998:6) to be "an essential element of academic argument" (see also Flowerdew, 2002:98). Hedges are linguistic devices used by authors to express tentativeness in their propositions and argument. Such devices can be detected and extracted by surface phenomena (Hyland, 1998:3), which suits the purposes of this research project that aims to use only those features that are easily extracted. Thus, although hedging is found in many linguistic forms, at both the lexical (for example, epistemic adverbs) and phrase levels, only features indicating hedging in the form of single words were used in this project. As mentioned in Section 3.3.6.4, hedges overlap with other word lists, such as reporting verbs (perception verbs and others), DEVELOPING THE CLASSIFIER 73 likelihood stance adverbs, and downtoners. Therefore, hedges were extracted as part of the groups above, detailed in Sections 3.3.6.4, 3.3.6.7, and 3.3.6.8. 3.3.6.7 Downtoners These adverbs are also known as diminishers as they are used to diminish the effect of a modified item. Their selection was based on Biber (1988:240) and Quirk et al. (1985:597-598). Examples axe partially, barely, virtually, and hardly. 3.3.6.8 Stance adverbs Stance adverbs overlap with Halliday and Matthiessen's (2004:125-131) comment adjuncts, which express the author's attitude towards the content of the text or message. The selection of these adverbs is derived from Biber et al. (1999:557-558, 853-874) and Biber et al. (2004:32). 1. Non-factual adverbs Non-factual adverbs are used to express the author's feeling regarding the quality of certain information (Biber et al., 2004:32). Examples are confidentially, typically, strictly, and frankly. 2. Factual adverbs These adverbs are used to comment on the reality of a proposition (Biber et al., 1999:557). Examples are always, undoubtedly, indeed, and certainly. 3. Likelihood adverbs Such adverbs are used to show propositional limitations and demonstrate evidence for a proposition whilst not being specific as to the source (Biber et al., 1999:557). Examples are predictably, roughly, apparently, and perhaps. 4. Attitudinal adverbs These types of adverbs indicate the attitude of a writer towards a proposition (Biber et al., 1999:558). Examples are fortunately, surprisingly, astonishingly, and curiously. CHAPTER 3 74 3.3.6.9 Stance adjectives 3.3.6.9.1 Stance adjectives controlling that-clmses According to Biber et al. (1999:671-674), adjectives controlling that-c\auses all control stance. The following features were selected based on Biber et al. (2004:34-35). 1. Attitudinal adjectives Such adjectives demonstrate a writer's evaluation of a proposition or event. Examples are amused, necessary, unusual, and incidental. 2. Certainty/factual adjectives These adjectives indicate certainty on part of the writer regarding particular propositions (Biber et al, 1999:671). Examples arc false, true, correct, and inevitable. 3. Likelihood adjectives Examples are probable, possible, doubtful, and likely. 3.3.6.9.2 Stance adjectives controlling fo-clauses The following list represents the major semantic domains in which adjectives that control to-clauses occur (Biber et al., 1999:716-721). Again, these adjectives were selected according to Biber et al. (2004:35). 1. Certainty adjectives Examples are unlikely, guaranteed, apt, and liable. 2. Ability/ willingness adjectives Examples are careful, fit, unable, and able. 3. Personal affective adjectives Examples are perturbed, glad, surprised, and astonished. 4. Ease/difficulty adjectives Examples are pleasant, easier, difficult, and tough. 5. Evaluative adjectives Examples are desirable, better, wonderful, and useless. DEVELOPING THE CLASSIFIER 75 3.3.6.10 Nouns 3.3.6.10.1 Stance nouns taking that-clauses According to Biber et al. (1999:648), tfiaf-clauses complements of nouns are "one of the primary devices used to mark stance in academic prose". The list of stance nouns below was derived from Biber et al. (2004:33) and Biber et al. (1999:648-651). 1. Factual nouns Examples are knowledge, realisation, assertion, and fact. 2. Likelihood nouns Examples are impression, idea, suspicion, and contention. 3. Non-factual nouns Examples are news, report, proposition, and comment. 4. Attitudinal nouns Examples are hope, reason, view, and thought. 3.3.6.10.2 Nouns taking ro-clauses According to Biber et al. (1999:652), these nouns occur frequently in academic discourse. Thus, in following Biber et al. (2004), the controlling nouns were used as features. Examples are proposal, intention, potential, and authority. An overview of all the features used as attributes of texts in the corpus has been provided above. Many of these features, especially those that are part of word lists, were simply extracted, using STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006) with no annotation required. However, other features required annotation; what these features are, how they have been annotated and the reasons for their annotation are discussed in Section 3.5. The next section provides all the details relating to preparing the texts for mark-up and subsequent feature extraction. 3.4 Text preparation before feature extraction This section provides all the information relating to the preparation of all the texts before feature extraction. Since this section details some preliminaries prior to mark-up there is some reference, and occasional (but logically necessary) overlap with Section CHAPTER 3 76 3.5, which details mark-up of the text for features that cannot be automatically extracted. The first part of text preparation presents all the necessary information, relating to the cleaning and preparation of the data before feature extraction. It should be noted that unless stipulated, this preparation was done, using regular expressions. The scripts were written in Perl, and compiled and run on Linux. The full scripts can be found on the accompanying CD. The texts were prepared by first converting them to text format. This means that all formatting, such as holding, italics and varying text size, was removed. The texts were originally encoded in Unicode but this character encoding was problematic for processing by the SVMTagger,60 as the English models used for training are in Latin-1, known more formally as ISO/IEC 8859-1 (Gimenez, 2007). All the texts were thus converted to Latin-1, using Utrac-0.3.0, a command-line tool that converts character sets (Calando, 2004).61 The converted texts were also used for the CLAWS4 tagger that takes plain text ASCII files as input. Next, the essay questions, essay titles, bibliography, appendices, headings, and footnotes were removed manually. As the essays are written by students from the broad spectrum of faculties, a mixture of graphs, illustrations, tables, and equations occur. These were not considered as part of the data and were removed from the essays, by first marking them up, using regular expressions and then removing them (along with their tags) automatically. In order to facilitate paragraph mark-up, all indents indicating paragraphs were removed, and paragraphs were separated by a blank line. Punctuation marks that were used as features are detailed in Section 3.5. It is necessary to note here, however, that brackets (square and angle) were removed automatically because content in these brackets were mainly equations. Other common punctuation marks and symbols (%, =, <, >63 and &) were converted to words to allow them to be tagged as words rather than as symbols (the tag SYM, used for symbols is not reliably tagged, see Appendix 2). The ellipsis character, if inserted as a special character in Section 3.5 discusses this tool, which was used for part-of-speech tagging. 61 The tool is freely available at http://utrac.sourceforge.net/download/utrac-03.0.tar.gz. 62 Section 3.5 also discusses this tool, which was used for part-of-speech tagging. 63 This does not refer to angle brackets used as brackets but rather to the mathematical symbols bigger than and less than. DEVELOPING THE CLASSIFIER 77 Microsoft Word, yields a strange symbol in text format. This practice was very common among the writers of the essays, thus this symbol was converted to the usual three ellipsis dots (...). Non-standard usage of the apostrophe, according to Standard British English writing conventions (Truss, 2003:46) in, for example 1960's if indicating a plural, not possession, was removed. This was done for the taggers, which otherwise would tag all instances of such occurrences as possessive. In cases where long quotations (an instance of non-integrated quotations) were not marked with quotation marks, quotation marks were inserted manually. Such quotations were then also incorporated into the paragraph preceding them, if the quotations were introduced by a colon or any other clausal punctuation (including a comma). All 'smart' quotation marks (",", ' and ') were replaced with straight quotation marks (' and "). In addition, all quotation marks were standardised as double straight quotation marks ("). Cases of play titles and book titles were standardised to single straight quotation marks Q. Apostrophes for both possession and omission were retained in 'smart' apostrophe form during XML mark-up, which, as previously mentioned, is detailed in Section 3.5. This was done to allow for easy distinction between single straight quotation marks and apostrophes. For the set of texts used for submission to the taggers, all non-integrated quotations were removed because they were regarded as extra-corpus material, as they are long sections of text not written by the author of the essay and thus considered not characteristic of the essay itself (this is in following Nelson, 2002:10). This is also true of integrated quotations. However, these quotations form part of the sentence and could not be removed without detrimental effect to the POS tagging results. Unfortunately, removal of the non-integrated quotations could not be done without first marking-up the texts with non-integrated quotation tags, in order to flag and remove the quotation itself. The mark-up script did all the marking-up simultaneously; therefore, all other tags had to be removed again for the texts used for POS tagging. In addition, all 'smart' apostrophes were converted to straight apostrophes, as 'smart' apostrophes are not CHAPTER 3 78 included in the Latin-1 character set. The final step was then tokenising the texts, so that the input texts had one word (a token), punctuation mark, or number per line. After text preparation and before feature extraction, annotation of various features was first required, in order to extract them. Throughout this section, some mention has been made of annotation. The next section provides all the details relating to the mark-up of features. 3.5 Annotation of the corpus This section describes the process of annotating the features of the essays that could not be easily extracted without annotation, using Oxford WordSmith Tools 4.0 (Scott, 2004) or STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006). These features are tagged with two types of tags: POS tags and XML tags. The section first discusses POS tagging and then XML tagging. 3.5.1 POS tagging The texts were grammatically annotated for parts-of-speech, using a software program called a POS tagger, which assigns to each word in a text its POS in context (Garside & Smith, 1997:103). POS taggers can take three approaches: a stochastic approach, a rule- based approach, or a hybrid approach (a combination of both the stochastic and the rule- based approaches). Stochastic taggers select the preferred tag of a word based on probabilities, which in turn are based on the tags in the training corpus.64 Such probabilities are calculated according to the frequency of the particular tag as well as the probability of the tag based on its immediate neighbours (physical context) in terms of both their words and associated tags (Garside & Smith, 1997: 102). Rule-based taggers make use of linguistic knowledge to tag words according to syntactic correctness (Jackson & Moulinier, 2002:13),65 for example, such taggers make use of the phrase- structure rules of a particular language. See, for example, Brants (2000) for a stochastic tagger. See, for example, Brill (1992) for a rule-based tagger. DEVELOPING THE CLASSIFIER 79 There were two requirements for selecting a POS tagger for this project: 1. it had to have a high degree of accuracy; and 2. it had to require no training, that is, have already been trained and evaluated on English POS tagging. Stochastic taggers require less handcrafting, and therefore, less manual annotation. Thus, with the requirements for a POS tagger, for this project, in mind, the SVMTagger from SVMTool was selected.66 According to Smith (1997:138), the benchmark error rate for POS taggers is 3 to 5 percent. SVMTool's reported accuracy is higher than any other tagger, according to the literature, with a reported accuracy of 97.16 percent for English on the WSJ corpus (Gimenez & Marquez, 2004). It makes use of the Penn Treebank tagset (Marcus, Santorini & Marcinkiewicz, 1993), which is made up of 36 tags (the complete list of POS tags is available in Appendix 2). This is the tagset, which English POS taggers, in the main, are trained on. Initially, the relatively small size of this tagset appeared to be favourable for this research project where the feature set was already very large. For the purposes of this project, the English models based on the WSJ corpus were used. These models are the ones on which SVMTagger is trained, which suited the purposes of this project that required a trained tagger. The corpus was tagged, using the Perl version of SVMTool vl.3. The specific usage and options used can be found in Appendix 3. Later in this study, it was hypothesised that a larger tagset might highlight linguistic subtleties that would not be noticed with the Penn Treebank tagset. Furthermore, it was postulated that a tagset, which extracted finer linguistic detail may extract potentially useful discriminatory features, and thus a classifier built on this tagset would perform more accurately than one built on the Penn Treebank tagset (this hypothesis will be discussed further in Chapter 4). An option would have been to train SVMTool, using an expanded tagset. However, as has been mentioned, a trained tagger was desired for this project. For this reason, the texts were tagged, using the CLAWS4 (Constituent Likelihood Automatic Word-tagging System) tagger, which is also a stochastic tagger Freely downloadable from http://www.lsi.upc.edu/~nlp/SVMTool/. The models are also downloadable from http://www.lsi.upc.edu/~nlp/SVMTool/. CHAPTER 3 80 like SVMTooir The tagger was developed at the universities of Lancaster, Oslo, and Bergen by Roger Garside (Garside, 1987:30) and is written in C. This version of the tagger uses the CLAWS7 tagset, which is a modification of the original CLAWS 1 tagset. The CLAWS7 tagset consists of 140 tags. Similar to SVMTool, CLAWS also performed well in previous projects, and consistently achieved a 96 - 97 percent accuracy. The version of the tagger that was used for this research project was used to tag circa 100 million words of the BNC, for which an error rate of 1.5 percent, with circa 3.3 percent ambiguities unresolved, was achieved (UCREL, 2006). Classifiers were thus trained, using the Penn Treebank tags and the CLAWS7 tags separately, in order to test the hypothesis presented here. 3.5.2 XML tagging In addition to POS tags, XML mark-up was used for sentences, paragraphs, quotations, references, punctuation marks, nominalisations (identified by suffix), two- and three- word complex prepositions, two-word adverbs, and multi-word conjuncts. The XML tags were inserted by matching characters, using regular expressions. The scripts were written in Perl, and compiled and run on Linux. Full scripts can be found on the accompanying CD, and the XML tags that were subsequently extracted as data for the classifier, can be viewed in Appendix 2. The resulting tags, such as the POS tags and unannotated features, were then extracted, using STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006). The scripts used to do the sentence mark-up were mostly error free. However, some manual proof checking was required to correct errors such as abbreviations with full stops, that were sometimes incorrectly marked-up with end-of-sentence tags, and capitalised words, that were sometimes incorrectly marked-up with beginning-of- sentence tags. In addition to sentence mark-up, paragraphs, quotations, and references were marked up and their tags extracted. Quotations were marked-up, either as integrated or non- integrated quotations. This distinction and separate tagging of integrated and non- 68 The CLAWS4 tagger is not free but it can be freely sampled, for more, see http://www.comp.lancs.ac.uk/ucreiyclaws/. DEVELOPING THE CLASSIFIER 81 integrated quotations follows the mark-up used for the International Corpus of English (Nelson, 2002). Sentences within non-integrated quotations were also marked-up separately. These were marked up to extract counts on the features, and to flag them for removal before POS tagging. Sentence, quotation, and reference mark-up were necessary because sentence, reference, or quotation distinctions needing to be retrieved could not be automatically retrieved through feature extraction. This was not so for punctuation marks, nominalisations, two- and three-word complex prepositions, two-word adverbs, or multi-word conjuncts. They can be automatically extracted from the text by simple pattern matching. However, STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006) cannot extract punctuation marks, nor is it capable of fuzzy text matching, which was necessary for extracting nominalisations. It was also somewhat unreliable in extracting phrases. Therefore, punctuation marks, nominalisations, three-word complex prepositions, two- word complex prepositions, two-word likelihood stance adverbs, two-word factual stance adverbs, and multi-word conjuncts were also marked-up with XML tags, which could be matched and then extracted using STATISTICA Text Mining and Document Retrieval module (Statsoft, 2006). This section has described the annotation of essays used in training the classifier. The next section details the preparation of the data before submission to the classifier. 3.6 Data preparation before classification The preceding section has discussed the mark-up of the texts, using POS tags and XML tags. This section discusses the preparation of the data before it was submitted to the SVM classifier. Once all the features discussed in Section 3.3 were extracted from the texts, some data processing was necessary. Firstly, a dependent variable needed to be calculated. The dependent variable is the variable that depends on the predictor variables to determine its value. In the case of this project, the dependent variable is the variable, which indicates the classification of the essays: 'good' or 'bad' examples of the argumentative essay genre. CHAPTER 3 82 In order to calculate the values of the 'good'/'bad' variable for each essay, the original grades of all the essays are required. As all the essays from the BAWE were written for real assignments in various departments, this means that the original grade assigned to each was already available. No additional grading by hand was therefore necessary. However, three values for grading are found in the BAWE: typical percentage grading (for example, 60 percent), class grading (2:1 and 1:0) and evaluative grading (either satisfactory or excellent). Percentage grading was selected as the more usual value, and so it was decided to standardise the other two types of grading to percentage grading. Of course, this choice is arbitrary. The only requirement is that only one value was selected to represent the dependent variable, viz. grade. The standardisation of grading type was done as follows. Only two types of class grading are given in the BAWE: those representing first degrees (1:0) and upper second degrees (2:1). First degrees, which fall within 70 percent and above, were converted to 85 percent (Holmes, 2007b), while upper second degrees, which fall within 60 to 69 percent, were converted to 65 percent (Holmes, 2007b). Similarly, evaluative grading in the BAWE is only of two types: excellent and satisfactory. The first evaluative grade, excellent, represents a first degree and the second evaluative grade, satisfactory, an upper second degree (Holmes, 2007a). Thus, following the conventions for first and second degrees stated above, the first evaluative grade was taken to represent 85 percent, and the second evaluative grade 65 percent. The standardised grading available for every essay could now be used to compute the 'good'/'bad' variable. This was done by calculating the median of the grades, which was 68 percent. Therefore, any essay with a mark of 68 percent or above was considered 'good', and likewise any essay with a mark of 67 percent or under was considered 'bad'. In addition to this first classification scheme by grade, for the purposes of a separate experiment, essays were grouped as 'good' examples if they had a mark of 70 percent or above and as 'bad' examples if they had a mark of 65 percent or less. This method removed the essays that had a middle-band grade of between 66 and 69 percent. The DEVELOPING THE CLASSIFIER 83 idea was derived from Finn (2002:71), whose pre-classification of restaurant reviews as negative for marks above 15 or positive for marks above 23, precludes the middle-band marks, which are more difficult to classify. In this project, it was hypothesised that the perceptible difference (in terms of the linguistic features used here) of essays that are of 'average' grade, that is less prototypical, is not sufficiently discriminant to determine prototypical ('good') or non-prototypical ('bad'). This is illustrated in Figure 3.1 below. It therefore seems sensible to rather use only essays that are most likely to be definitely 'good' or 'bad', by using only the upper- and lower-band grades. 'bad' examples 'average' examples 'good' examples argumentative essay genre Figure 3.1: Illustrating the genre classification task, 'average' examples The classifier was trained separately, on both these sets, the results of which will be reported in Chapter 4. It should be noted that, for the real application of this classifier, essays with middle-band grades will need to be classified. The removal of essays of 'average' grade is experimented with in this project, in order to determine whether the features used in this project are indicative of essays on either side of the continuum. This experiment does not imply that the classifier will be unable to classify essays falling in the middle-band grade. Secondly, because of multiple (two or three) occurrences of terms in the word lists used for feature extraction, some data cleaning was necessary69. Multiple occurrences were therefore removed from the data. If this was not done these words would be given more weight in classification than other singly occurring words. This is because the frequencies of features lend them weight in classification, rather than multiple counts of their frequencies. The multiple occurrences arise from more general word lists such as The list of terms with multiple occurrences in the word lists can be found in Appendix 3. CHAPTER 3 84 the top fifty words in the BNC and the top function keywords. They do not arise because the words serve different functions. For example, whilst, occurs in the subordinating conjunctions list and in the top function keywords list. Thirdly, before raw frequencies of terms could be given to the classifier, normalisation was required. Normalisation in this sense entails bringing each feature vector to unit- length according to the notions of length in a particular vector space. For this project, the data were normalised to text length, which corresponds to the Li-norm.70 Text- length normalisation provides a way for texts of different lengths to be compared and is common practice in corpus linguistics and genre classification studies (see for example, Dewdney et ah, 2001; also Santini, 2005c). It was hypothesised that for this project text-length normalised frequencies would yield better results than raw frequencies as frequencies normalised to text length are less skewed. Text-length normalisation is given by (Biber, Conrad & Reppen, 2000:265): rwf/twxn, (3.3) where rwf is the raw frequency of a particular word/term in a document, tw is the total number of words in a document, and n is the basis for norming. This basis should be selected based on the average number of words in all the documents being compared. For the normalisation of the raw frequency counts for this project n = 1000. For example, if the raw frequency count for variable VBD is 53, the normalised frequency is given by: 53/1483x1000 = 35,72 per 1000 words. Once features were extracted they were each normalised in this way by using an equation in the spreadsheet they were extracted to. This process was not automatic, however, as this equation was re-entered for each feature (but not for each text). Obviously, no normalisation was required for word length, sentence length, or paragraph length as they are not raw frequency counts and represent averages per document. The raw frequency counts were also transformed, using logarithmic transformation and inverse document frequency transformation (idf, as explained in Section 2.3.2), which 70 This norm is referred to as the block-city norm and corresponds to the linguistic notion of text length. See Leopold and Kinderman (2002) for more on this, and also on using the L2-norm (Euclidean norm) instead of the Li-norm. DEVELOPING THE CLASSIFIER 85 are both typical transformations used in text classification (Forsbom, 2005:7; see Lee & Myaeng, 2002 for a genre classification study that uses idf with good results; also Leopold & Kinderman, 2002 for an SVM study in text classification that uses idf). Logarithmic transformation is a type of normalisation used to transform data that do not follow a normal distribution, which is an important distribution assumption, essential to many statistical tests and the interpretation of results.71 Normalisation in this sense means that frequency distributions are transformed to follow a normal distribution. Logarithmic transformation is one of three options provided by Rietveld and Van Hout (2005:130-131), in order to overcome situations where the variables do not follow normal distribution. It is computed, using STATISTICA Text Mining and Document Retrieval (Statsoft, 2006) and is given by:72 f(wf) = 1 + log(wf), for all wf > 0, (3.4) where wf is the frequency of a particular word/term in a document. Using the previous example, the logarithmic transformation frequency for variable VBD is given by: / ( w / ) = l + log(53). Inverse document frequency is also computed, using STATISTICA Text Mining and Document Retrieval (Statsoft, 2006) and is given by:73 f 0 ifwfy=0] ; | [l + logfw/Jjlog— i f w / ^ l | where wf is the frequency of a particular word/term in a document, I is the number of documents, df is the number of documents in collection t that contain the word/term Wj, and j refers to the document. All the raw frequencies of variables were normalised and transformed in this manner.74 This resulted in four sets of data: 1. a set of raw data frequencies; 2. a set of text-length normalised frequencies; 71 See Oakes (1998:3-5) for more on normal distribution. 72 This transformation uses log base e (Statsoft, 2004). 73 Like the logarithmic transformation, the idf uses log base 10. 74A11 the raw and normalised frequencies for the actual extracted features can be found on the accompanying CD. CHAPTER 3 86 3. a set of log-transformed frequencies; and 4. a set of idf frequencies. All four data sets were used to train four separate classifiers. The results of testing these classifiers will be discussed in Chapter 4, and the complete results in terms of both training and testing are available on the accompanying CD. Finally, various feature selection methods were used to select those variables that are better predictors of the dependent variable.75 Feature selection was only conducted on the text-length normalised data as this data set proved to perform the best (this will be discussed in Chapter 4). Feature selection appears to be not as relevant for SVMs as for many other machine learning and statistical classification tests (such as decision trees); nevertheless, some studies have reported that with many irrelevant features SVMs may perform poorly (see for example, Weston, Mukherjee, Chapelle, Ponntil, Poggio & Vapnik, 2001; also Liu & Zheng, 2006). Feature selection was thus done for this project, in order to compare the performance of a classifier built on the full feature set, that is, with no feature selection, to a classifier built on a reduced feature set, that is, with feature selection. Variables were selected, using two kinds of tests: 1. parametric tests, which assume normal distribution; and 2. non-parametric tests, which do not make this assumption. These two types of tests were required to compare predictive results, as some of the variables used, follow a normal distribution and others do not. The tests assuming normality were the Mest for independent samples and STATISTICA Feature Selection and Variable Screening module (Statsoft, 2006). This test requires specification of the number of cuts (k) into which the values of the predictor variables must be divided. The value selected for the number of cuts detects monotone to complex non-monotone relationships between the predictor variables and the dependent variable. The default value for k is 10, with smaller values detecting simpler relationships. For this research project, k was set equal to 2, 3, 4, 5, 6, 7, 8, 9, and 10. 75 Note that the term feature selection has two meanings, see Section 3.7.3 for more on this. 76 For each variable's descriptive statistics and for a sense of variable distributions see the accompanying CD. DEVELOPING THE CLASSIFIER 87 The tests not making use of the normality assumption were the Mann-Whitney U test (see Oakes, 1998:17 for more on this test)77 and the Kolmogorov-Smirnov Two-Sample Test. The non-parametric tests are the second of the three options given by Rietveld and Van Hout (2005:132) for cases in which the normality assumption is violated. Many of these tests are available; the two tests used here were selected as they can be used for comparisons between two independent groups (see Siegel & Castellan, 1988 for a discussion of non-parametric tests). The parametric tests and non-parametric tests were only used on normal and non-normal variables respectively. Furthermore, in order for any particular feature to be accepted, both tests for each test type were required to agree on the inclusion of this feature. The best predictor variables, according to the feature selection tests discussed above and all the results using the feature selection techniques as described above, are documented in Appendix 5. The results of the classifier built on the set of the best predictor variables will be presented in Chapter 4. Following McClave and Sincich (2000:229-232) normality was determined, using four descriptive methods. The first two are visual assessment methods and involve constructing a histogram fitted with a normal curve, as well as a normal probability plot for each variable. The histogram shows that the data are normal if they follow the curve. Similarly, the normal probability plot shows that the data are normal if they follow the straight line closely. The third method for assessing normality was to determine the interquartile range (IQR) and the standard deviation ( 0 . This function, / ( x ) , is a linear function of xe X and is given by Cristianini and Shawe-Taylor (2000:9): f(x)=(y/-x) + b V u ( 3-8 ) where w and b are the parameters controlling / ( x ) ; w is the weight vector perpendicular to the hyperplane and b is the threshold (bias) that moves the hyperplane parallel to its former position.80 The decision rule, which governs the estimate of the target function / ( x ) , determines that an instance will be classified as -1 if x < 0 or +1 if x > 0. This situation can be more easily understood if interpreted geometrically as the training space is divided into two (classes) by the hyperplane, defined by Equation 3.8, as illustrated in Figure 3.2. The vector is a line with direction in space and it is said to be normal to the hyperplane. CHAPTER 3 92 A +1 +1 +i w +1 I hyperplane separating the classes in space \ Figure 3.2: A separating hyperplane (adapted from Cristianini & Shawe-Taylor, 2000:10) This method of learning by a separating hyperplane is called an ill-posed81 problem because it has many different solutions, not all of which are equally helpful. In order to make this problem well-posed, one solution is to optimise a cost function, which ensures a unique solution (on the condition that a solution exists). This would mean selecting, from a hypothesis space of various hyperplanes, the hyperplane that is a maximum distance from the data points, thus maximising the margin (Cristianini & Shawe-Taylor, 2000:19). This is the maximal margin (optimal) hyperplane that separates the data points (feature vectors) without error and such that the distance between the plane and the nearest vector is maximised. The unique solution is thus characterised by maximum distance and minimum error. The functional margin of a training instance (x(.,y(.) is defined with respect to the hyperplane (w,£) as (Cristianini & Shawe-Taylor, 2000:11): ^ = ^( (w-x ,>+i) . (3.9) See Vapnik (2000:233-240) for more on ill-posed problems. DEVELOPING THE CLASSIFIER 93 This definition encompasses an inherent degree of freedom in that w and b can be rescaled from (v?,b) to (Zw,Ab), Ze R+, without changing the function, while making the functional margin bigger. The solution to this scaling situation implies constraining the value of w. The geometric margin is defined as the functional margin of a normalised82 weight vector and it remains unchanged if w and b are scaled as it is scaled by |w| , where the length of w is defined in terms of the Euclidean notion of length (Cristianini & Shawe-Taylor, 2000:94-95; also Scholkopf & Smola, 2002:192). This norm is termed the Euclidean or L2-norm. In order to determine the maximal margin hyperplane, the geometric margin must be maximised. This can be done by minimising the weight vector's (Euclidean) norm and by making the functional margin equal to 1. The latter ensures a constraint on the value of w and it implies that + b = +1, and /w x ) + b = -l. Using substitution, the geometric margin can be given by: 1 W w|| V =TO-«~**M~«-» 2||w| J_ llwll (3.10) The above discussion leads to the actual optimisation problem. The maximal margin hyperplane, which will successfully separate the training set by realising the geometric margin given in Equation 3.10, must minimise the objective function: >.((wx,.) + fc) 0, a data point will be classified as +1 (Abe, 2005:20). This is illustrated in Figure 3.3 below. SV maximal margin hyperplane maximal margin ^ Figure 3.3: A maximal margin hyperplane that perfectly separates St, with its support vectors (SVs) The optimal hyperplane classifier that has been explained in this section, is referred to as the hard margin classifier (Abe, 2005:19; also Hechter, 2004:45). It is the simplest form of the SVM and serves to demonstrate the theory of SVMs. However, for real problems, data are often not linearly separable or it may be that the data are linearly separable overall except for a few data points. The former problem can be resolved by kernels, which are introduced in the following section, and the latter problem can be solved by an extension of the hard margin classifier called the soft margin classifier, which is discussed in Section 3.7.4. The discussion on soft margin classifiers does not follow the discussion on hard margin classifiers directly as kernels are used in the soft margin equations, and some prior explanation of them is thus required. CHAPTER 3 96 3.7.3 Kernels83 As mentioned above, the optimisation problem for non-linearly separable data cannot be solved, using the hard margin classifier discussed in the preceding section. Kernels provide a way of transforming non-linearly separable data into a high dimensional feature space, in order to separate the data points linearly. This project only made use of linear kernel SVM classification but will report on preliminary test results in Chapter 4, using the Radial Basis Function (RBF) kernel, in order to motivate the choice of linear kernel SVM classification for this research project. Therefore, the discussion that follows, is brief, presenting essentially the basic concept behind kernels; see Cristianini and Shawe-Taylor (2000:26^9), Scholkopf and Smola (2002:200-204), and Hechter (2004:53-59) for more detailed discussion of kernels and their mathematical derivation. In order to understand how kernels work it is necessary to first explain input space and feature space. In machine learning, pre-processing of data involves selecting a particular representation of the data, in order to suit the particular learning task. Changing the representation of that input data into another representation essentially maps the input variables in the input space X into features in the feature space F. This situation can be represented as follows (Cristianini & Shawe-Taylor, 2000:27): x = {xi,...,xe)^(x) = (L(x)). (3.13) The variables, which have been described in Section 3.3, are the input variables, and their representations in F are termed features (Cristianini & Shawe-Taylor, 2000:27).84 In this context, there are two uses of the term feature selection. In the first usage, the term refers to selecting the representation of the input variables; in the second usage, the term refers to detecting and eliminating irrelevant attributes (Cristianini & Shawe- Taylor, 2000:29).85 Feature selection of the former usage often seeks to reduce the number of features, in order to reduce the dimensionality of the problem. This is generally considered useful as generalisation and computational performance degrade as the feature set increases; this is commonly referred to as 'the curse of dimensionality' 83 For a detailed discussion of kernels see Scholkopf and Smola (2002: 25-54). 84 These variables have been termed features according to common practice. However, in the context of machine learning this term is not strictly correct. Henceforth, these variables will be termed attributes or input variables. 85 It is this latter term that is meant in Section 3.6. DEVELOPING THE CLASSIFIER 97 (Cristianini & Shawe-Taylor, 2000:28).86 The problem can be overcome in SVMs by using kernels (Cristianini & Shawe-Taylor, 2000:28). The basic premise of kernels is that non-linearly separable data cannot be separated by a linear function in X but can be separated in F.87 This situation is illustrated in Figure 3.4. 4> A o o\ X - > Figure 3.4. Feature mapping 0: X -> F (Cristianini & Shawe-Taylor, 2000:28). This means that non-linearly separable data need to be mapped to a feature space where a linear machine can be used, using a set of non-linear features. The building of non­ linear learning machines is thus conducted in two steps. The first step is to transform the data into F, and the second step is to classify the transformed data in this F, using a linear machine (Cristianini & Shawe-Taylor, 2000:30). The dual representation introduced for linear machines in Section 3.7.2, allows each hypothesis to be expressed as a linear combination of each training point. This means that the decision rule can be evaluated, using the inner products (also dot products) of the test and training points (Cristianini & Shawe-Taylor, 2000:30). See Bellman (1961). This is mathematically motivated by Cover's theorem (Cover, 1965). CHAPTER 3 98 A kernel function allows for the direct calculation of the inner product in F, as a function of the original input data points. In this way, it merges the two steps of building non-linear machines. It is defined as (Cristianini & Shawe-Taylor, 2000:27): K(x,z) = ((x)-(z)), for all x , z e Z . (3.14) This means that the kernel function will allow for the calculation of the decision function of optimal separating hyperplanes, without needing to know the feature map, that is, knowing K means not needing to know (p. This is referred to as the 'kernel trick' (Schblkopf & Smola, 2002:201). It is based on work by Aizerman, Braverman and Rozoner (1964) but was first applied to non-linear SVMs by Boser et al. (1992). As previously mentioned, this project reports on results obtained, using the RBF kernel. These results will be discussed in Chapter 4 and are available for review on the accompanying CD. The next section discusses the extension of hard margin classifiers to soft margin classifiers. 3.7.4 The soft margin classifier The chief problem with the maximal margin classifier is that its hypothesis solution can be severely affected by any noise, for example an outlier data point. Cortes and Vapnik (1995) introduce a solution to this problem by the introduction of the slack variables. These variables allow the constraints of the margin (3.12) to be violated on condition that there is some increase in the value of the objective function (3.11) at the optimal solution. This allows the training set to be separated with some errors, but without allowing too many misclassification errors (underfitting the data), whilst ensuring that the maximal margin requirements are achieved. Redefinition of the margin in this way results in what is referred to as the soft margin approach (Schblkopf & Smola, 2002:204). The simplest way of including the slack variables in (3.6) is called the C-SV classifier. The objective function is given by Schblkopf and Smola (2002:205): O(wIfl = i (w-w)4 lS .C>0 I (3.15) DEVELOPING THE CLASSIFIER 99 subject to: £ > 0 , i = l,...,£; and (3.16) relaxed separation constraints: v,.((w-X/) + fc)>l-£, i = l,...,t (3.17) As with the hard margin classifier, the primal optimisation problem for the C-SV classifier can be solved by using the Lagrangian function. Again, the primal Lagrangian must be minimised with respect to the primal variables and maximised with respect to the dual variables. In this way, the dual representation of the Lagrangian is found. The dual is found in the same way as with the hard margin classifier by finding a saddlepoint and eliminating the primal variables. The Lagrangian variables are then used to solve the quadratic optimisation problem, by solving the value of weight vector w, the geometric margin of / and the threshold b. The decision function is similar to that of the hard margin classifier. The cost parameter, C of the C-SV classifier determines the trade-off between minimising the training error and maximising the margin. The optimal value of C cannot be calculated a priori. Instead, it must be selected with prior knowledge of the amount of noise in S, (Hechter, 2004:51). In order to establish this value of C it is common practice to test performance on a separate test set or v-fold cross-validation, using the training set with a wide range of values for C (Cristianini & Shawe-Taylor, 2000:104; also Abe, 2005:72-74).88 As this value cannot be calculated a priori, a modification was suggested, in the form of the v- parameter that replaces the C-parameter (Scholkopf, Smola, Williamson & Bartlett, 2000). The v- parameter, taking a value between 0 and 1, controls the number of margin errors and SVs. The realisation of a soft margin classifier that uses this parameter is called the v- SV classifier. For this classifier the objective function in (3.11) is given by Scholkopf and Smola (2002:206): V-fold cross-validation and the values used for calculating the optimal C value for this project will be discussed in Chapter 4. CHAPTER 3 100 *(^)4(w-w>-^+7^' (3-18) subject to the constraints: ^ ( ( w - x ^ + fc)>/?-£, i = l,...J, and (3.19) £>0, /?>0. (3.20) As with the hard margin classifier and the C-SV classifier, the primal optimisation problem for the v- SV classifier can be solved by using the Lagrangian function. Again, the primal Lagrangian must be minimised with respect to the primal variables and be maximised with respect to the dual variables. In this way, the dual representation of the Lagrangian is found. The dual is found in the same way as with the hard margin classifier and the C- SV classifier, by finding a saddlepoint and eliminating the primal variables. The Lagrangian variables are then used to solve the quadratic optimisation problem, by solving the value of weight vector w, the geometric margin of y and the threshold b. The decision function is similar to that of the hard margin classifier and the C- SV classifier. This research project used C-SV classification but will report preliminary results on v- SV classification, in Chapter 4. For this reason, the v- SV classifier has been briefly described with reference to the C- SV classifier. For a more detailed discussion on the v-SV classifier see Chen, Lin and Scholkopf (2005), and for information on the differences between these two soft margin classifiers, see Chang and Lin (2001). Besides the two variants of SVMs discussed above, there are other SVM variants that are not relevant to this study and, therefore, are not discussed here. One widely used variant for text classification, the transductive SVM, which is based on transductive inference (Vapnik, 2000:293), was first introduced and discussed in Joachims (1999b). Good discussions of further variants of SVMs can be found in Abe (2005:129-154) and Kroon (2003:33^11). DEVELOPING THE CLASSIFIER 101 3.8 Summary This chapter provided detailed background of the data and learning methodology used to develop the genre classifier. The data for this study were derived from a corpus of essays, which represent a section of the BAWE corpus. These essays have already been graded by human evaluators. The task now is to select the right types of features that will allow the classifier to learn to assign a similar evaluation to human evaluators. This chapter discussed all the features deemed potentially relevant as good predictors of prototypical or non-prototypical argumentative essays. The features were selected from those used in previous automatic genre classification studies and research in the field of academic discourse in general. Thereafter the processes of preparing the texts before feature extraction were detailed, and the annotation of features that could not be easily extracted discussed. Then information regarding the standardisation of the dependent variable, the removal of multiple occurrences of terms, data transformation, and feature selection were provided. Lastly, SVMs, the machine learning method that was used to train the genre classifier, was presented for both the linearly separable and the linearly inseparable cases. Three hypotheses arose from the data and were introduced in Chapter 3. The first hypothesis was that text-length normalised frequencies would be likely to yield better results than the raw frequencies as text-length normalised frequencies are less skewed. The second hypothesis was that a classifier trained on the CLAWS7 feature set would outperform the classifier trained on the Penn Treebank feature set because the CLAWS7 tagset extracts finer linguistic detail than the Penn Treebank tagset. The third hypothesis that has been postulated is that essays of an average grade are linguistically harder to distinguish and that the removal of such essays would improve the accuracy of the classifier. These hypotheses will be further examined in Chapter 4. Chapter 4 will detail how the data were used in training the SVM classifier, what training parameters were used, and the manner in which they were selected. It will also discuss the results of the classifier, detail well-known evaluation measures, and provide the most suitable accuracy measure for this project. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 102 CHAPTER 4 Results: Training, evaluation and interpretation SVMs are a rare example of a methodology where geometric intuition, elegant mathematics, theoretical guarantees, and practical algorithms meet (Bennett & Campbell, 2000:9) 4.1 Introduction In Chapter 3, the features and motivation for their selection were discussed. The machine learning technique used to train a LM to classify prototypical ('good') and non-prototypical ('bad') argumentative essays was also explained in the preceding chapter. The results of training such classifiers are required to be interpreted in terms of the objectives of this research project. As stated in Chapter 1, the classifier is trained to distinguish between instances of prototypical and non-prototypical examples of the argumentative essay genre, and this is done with a practical aim in mind. The classifier is intended to reduce the amount of marking for a senior marker while ensuring that non-prototypical essays are given the necessary attention. Thus, if the classifier succeeds in classifying essays into prototypical and non-prototypical classes, prototypical essays can be given to a junior marker and non-prototypical essays to a senior marker. The assumption here is that prototypical essays require less specialised attention, while the opposite is the case for non-prototypical essays. Moreover, the workload of a senior marker can be reduced by such a classifier, allowing more attention to be given to non-prototypical essays. In light of these objectives, it can be seen that classifying a 'good' example as 'bad' is undesirable but not detrimental, whereas misclassification of a 'bad' example will mean that a less experienced marker will be assigned the essay. This could prove to be problematic, as 'bad' examples require specialised attention. From this, it is clear that the latter misclassification is the more serious of the two. Therefore, the results presented here are discussed in light of these misclassification errors. This chapter commences with Section 4.2, in which the selection of the learning parameters used for training the classifier, v-fold cross-validation and theoretical concerns of this technique are detailed. In this section, three potential data problems that CHAPTER 4 103 are relevant to this research project, are introduced and, where relevant, their implications and solutions are presented. These problems relate to imbalanced data sets, differing misclassification costs and the normal distribution assumption. Thereafter, in Section 4.3, the most common evaluation measures and indicators are described, and the most relevant measure for this project is presented. Next, in Section 4.4, the results of the classifiers trained and tested are reported and discussed, and the results of the best classifier are determined.89 Furthermore, comparison of the classifiers built on each data set, using different kernels, features and cases is made. The hypotheses introduced in Chapter 3 are addressed in terms of the results of the classifiers built on the data concerned. In addition to this, the features selected, using the feature selection methods reported in Chapter 3, are discussed. Finally, in Section 4.5, the results presented in Section 4.4 are analysed, and potential explanations for the classifier's performance are examined. 4.2 Training the classifier, using support vector machines This section discusses the process of training the SVM classifier in terms of the selection of the learning parameters used for training the classifier. The section also addresses potential data problems such as imbalanced data sets, differing misclassification costs and the normal distribution assumption of the data set. 4.2.1 Parameter selection The parameters that influence the position and orientation of the maximal hyperplane are known as learning parameters (Eitrich & Lang, 2006:427). Selection of these parameters is essential for obtaining an accurate SVM (Hechter, 2004:108). This selection is based on performance measures of various values of the parameters. The values are usually based on a grid search of an interval of values for each parameter (Eitrich & Lang, 2006:428).90 Each value within this interval is determined according to the increment selected, that is, the amount by which the previous value increases in the following test. Each of the values within the value interval can be tested, using v-fold The results reported are those obtained on the test sets, unless otherwise stipulated. Also note that the words train and build are used synonymously. 90 Exhaustive search of all potential parameter values. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 104 cross-validation or a separate training and testing set. The former is the most commonly used method (Bennett & Campbell, 2000; also Scholkopf & Smola, 2002:217:10). For v-fold cross-validation, the training set is first divided into v sets, then the LM is trained on v-1 sets and validated on the one set excluded from training. This is performed over v training runs, moving one set up for each run so that once the process has been completed each set has served as a validation set. The performance over each run is averaged. This determines the accuracy of the particular parameter value being tested. The parameter value with the highest cross-validation accuracy is then selected as the most appropriate value, and this value is used to train the whole training set. Similarly, in cases where separate training and testing sets are used, the parameter with the highest accuracy is used to train the whole training set. Unfortunately, there are some theoretical precautions to v-fold cross-validation. Firstly, this approach can lead to overriding as the parameters are optimised on the same set as the one used for training. Secondly, the optimal parameter settings for the full training v - 1 set £ and the £ set are often not the same; and thirdly, it is possible that there is a v v - 1 phase-transition on the learning curve between ^and £ (Scholkopf & Smola, v 2002:217). This could result in a large generalisation error size, as a function of the set size between both sets (Scholkopf & Smola, 2002:217). Scholkopf and Smola (2002:217) suggest that the process of a grid search, using v-fold cross-validation could be avoided by making an educated guess at the optimal parameter values. Among others, they present the following, as methods of making such an educated guess (Scholkopf & Smola, 2002:218): 1. Select parameter settings that have been known to perform accurately for similar problems; and 2. Incorporate prior expectation of the error rate. For v-SV classifiers, the knowledge of the typical error rate for a particular problem can be incorporated by selecting a value for v, which is in the test error range. For C- SV classifiers, this knowledge can be incorporated by selecting a large value for Cand reducing it until the number of margin errors is in a suitable range (below the error rate). CHAPTER 4 105 Unfortunately, both these suggestions proved problematic for this research project. The first suggestion was followed but revealed very little as there is no other research (to the best of my knowledge) that attempts to use genre classification to distinguish between prototypical and non-prototypical student essays, or indeed any kind of academic prose, in the manner that this project seeks to do. The most relevant automatic genre classification studies, which were reviewed in Chapter 2 that use SVMs do not provide details of the steps they followed in selecting their models and parameters. Dewdney et al. (2001:5) mention that they make use of SVMllght's default settings, but they do not specify any more detail; and Argamon and Dodick (2004a:5) also state that they use the default settings of the SMO algorithm (Platt, 1999) implemented in the Weka package (Witten & Frank, 1999). The second suggestion could not be implemented, as no direct access to the actual equations in the algorithm used was available. Despite their cautions regarding the use of v-fold cross-validation to determine the optimal parameter settings, Scholkopf and Smola (2002:217) state that excellent results can still be obtained, using the same parameter settings as determined during a grid search, using v-fold cross-validation on the whole training set. In following Hsu et al. (2003:5-7), this project makes use of a grid search where the values for C are 2~5, T\ T\ 2\ 2\ 25, 27, 29, 2", 213, and 215; the values for v are 0.08, 0.16, 0.24, 0.32, 0.4, 0.48, 0.56, 0.64, 0.72, and 0.8; and the values for y are 2"15, 2"13, 2"n , 2"9, T\ T\ 2~\ T\ 21, 23, and 25. These values were tested, using a 75 percent training set and a 25 percent testing set, for the text-length normalised data and 5-fold cross-validation for the raw data, log- transformed data and the idf data.91 The most commonly used values for v-fold cross- validation in the research projects reviewed in Chapter 2 are 5, 10, and 20. This project made use of 5-fold cross-validation because the case set is relatively small (it is comprised of 346 texts). The C value interval that was used for the raw data, log- transformed data and the idf data, was searched, using an increment of 0.09375, that is, the value of the parameter was incremented by 0.09375 for each 5-fold test. This value Please note that a seed of 2000 was used for all tests, using a 75 percent training set and a 25 percent testing set. The default seed in Statistica (Statsoft, 2006) is 1000, but better performance was achieved, using a seed of 2000. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 106 was selected because it represents the smallest difference between the value intervals suggested by Hsu et al. (2003:5-7), as discussed above. Once the best value for C had been determined this value was tested, using a 75 percent training set and a 25 percent testing set.92 It is important to note that the best performance on the 75 percent training set and 25 percent testing set was measured by the recall of 'bad' examples. This performance measure was selected in accordance with the objective of the classifier, and is discussed in detail in Section 4.3. Note further that no v-SV classification was performed on the raw data, log-transformed data, or the idf data as this type of classification proved to be less accurate than C- SV classification, both in preliminary experiments and also for the text-length normalised data (the results of the v-SV classification are discussed in more detail in Section 4.4). For the same reason, only experiments, using the linear kernel were conducted for the raw data, log-transformed data, and the idf data. For the text-length normalised data the values of the grid search were tested individually. As STATISTICA SVM (Statsoft, 2006) does not allow v-fold cross validation unless a value interval is specified, the individual parameter values had to be tested, using separate training and testing sets. Hsu et al. (2003:7) further recommend conducting a finer grid search once values that perform best, using the values above, had been determined. Once the individual values that performed best had been determined for all the data sets, a finer grid search was conducted. The values used for this finer search differed for each set; the exact values used can be found on the accompanying CD. These values were once again tested using a 75 percent training set and a 25 percent testing set. 4.2.2 Potential data problems: imbalanced data sets, misclassification costs and the normal assumption This section introduces three potential data problems that are relevant to this research project. The potential problems raised are those of imbalanced data sets, differing misclassification costs, and the normal distribution assumption. As previously mentioned, the studies reviewed in this dissertation mainly made use of v-fold cross- validation. Therefore, there was no sense of standard practice on which to base the split used here. However, because of the small data set, a testing set smaller than 25 percent was considered too small (see also Biber, 1993b on representative sample size). CHAPTER 4 107 In addition to the problems put forth by Scholkopf and Smola (2002:217), as discussed in Section 4.2.1, Eitrich and Lang (2006:425) point out that the learning parameters are difficult to determine, using a grid search for an imbalanced data set and/or if the cost for a false negative classification is very high while a false positive is acceptable. Both these conditions hold for this project as the 'bad'/'good' essay ratio is 137/209, and the cost for classifying an essay labelled 'bad' as 'good' is high, whereas classifying an essay labelled 'good' as 'bad' is not desirable but acceptable. The terms false negative, false positive, true positive and true negative are derived from hypothesis testing, and thus work from the basis of a null hypothesis. As such, these terms are evaluated from the point of view of the null hypothesis or from one class. For this project, evaluation is on the basis of the non-prototypical ('bad') essays. This is illustrated in Figure 4.1 below: TRUE POSITIVES 'bad' examples classified 'bad' FALSE POSITIVES 'good' examples classified 'bad' FALSE NEGATIVES 'bad' examples classified 'good' TRUE NEGATIVES 'good' examples classified 'good' Figure 4.1: Matrix showing classification and misclassification terms The first problem, that of imbalanced data sets, put forth by Eitrich and Lang (2006:425) above, has a solution in STATISTICA SVM (Statsoft, 2006). This SVM implementation provides a penalty option as a solution for imbalanced data sets (Statsoft, 2006). In the case of this project, 'good' examples were penalised by the ratio of 'good'/'bad' examples, for example, 209/137, which gives a penalty of 1.53 for v- RESULTS: TRAINING, EVALUATION AND INTERPRETATION 108 fold cross-validation tests. This means that the LM will not classify a particular essay as 'good' simply because there are more 'good' examples.93 The subject of the second problem, put forth by Eitrich and Lang (2006:425), is differing misclassification costs. These costs indicate the relative severity of different misclassifications. The standard SVM learner generally assumes these costs are equal, while learning situations in which they are not can be considered non-standard (Lin, Lee & Wahba, 2002). In order to take the differing misclassification costs into consideration for training, the decision surface of the hyperplane needs to be selected to minimise the expected cost of future misclassifications (Hand, 1997:7). The SVM package used for this research project implements SVMs for the standard situation and thus does not allow for the resolution of this problem.94 The third potential data problem is the assumption of normality. It should be noted that establishing normality, as discussed in Section 3.6, was important for determining tests to be used for feature selection but that normal distribution is not important for the SVM training itself. In Section 3.7, it was stated that the entire data set used for SVMs are assumed to be independent and identically distributed observations. This has two important implications for SVM training: first, both training and testing data are assumed to be drawn from the same distribution, and second, this distribution is not assumed to be a normal one (Noble, 2006:1566). The first implication can be a problem for SVMs as the training and testing data set may not share the same distribution (see Hand, 1997:9-10). Furthermore, the distribution of the entire data set, which is intended to be a sample of the target population, may not actually share the same distribution as the target population. This is addressed in more detail by Lin et al. (2002:192). For this project, the latter problem is not considered to be relevant, as the classification task is not a population problem. It is concerned with the features distinguishing prototypical and non-prototypical examples of the argumentative essay and the distribution of these features, rather than the distribution of prototypical and non-prototypical examples of the argumentative essay in general. For this project, it was hypothesised that normalisation of the data in the form of the logarithmic transformation would not affect See Karakoulas and Shawe-Taylor (1999) for more on imbalanced datasets. See Lin et al. (2002) for a study that addresses SVM learning solutions for the non-standard situation. CHAPTER 4 109 the classifier's results, if compared with the classifier trained and tested on the raw data set. The next section introduces various commonly used evaluation measures. These measures are discussed in terms of their relevance to this project, and the measure indicating good performance for this research project is introduced. 4.3 Evaluation indicators and metrics The objective of evaluating performance is to determine how well a classifier will perform in classifying new instances (Hand, 1997:9), that is, how well it will generalise. This is considered a foremost concern for this project. However, performance on the training set is also relevant as it provides an indication of the good design of the features as well as the fit of the model. For example, the performance on the training set, if compared with the performance on the testing set, can show whether the model is overfitting the data. A variety of evaluation metrics and indicators are used in the automatic genre classification studies reviewed in Chapter 2, the most common of which are error rate, accuracy, the baseline, improved performance, precision, recall, and the F-measure. This section discusses all of these measures and detail which measures are most relevant for this research project. The first evaluation metric that is addressed is error rate or misclassification rate, which gives the proportion of objects that were misclassified. In classification studies, it is desirable to reduce the error rate as much as possible, and this measure is therefore the most popular evaluation measure (Hand, 1997:98). It is problematic because it treats all types of misclassification equally. For example, in the case of this project, error rate will treat cases of misclassification of 'bad' examples and misclassification of 'good' examples as equally undesirable. This is problematic because misclassification of 'bad' examples will result in some essays requiring specific attention by a senior marker to be neglected and instead passed on to a junior marker. Therefore, misclassification of 'bad' examples is to be regarded as more serious than misclassification of 'good' examples. As discussed in Section 4.2.2, differing misclassification costs need to be incorporated into the selection of the decision surface. The decision surface, which simply minimises the error rate, arises from the assumption that the misclassification costs are equal RESULTS: TRAINING, EVALUATION AND INTERPRETATION 110 (Hand, 1997:7). This measure is therefore not considered to be helpful for this research project and is not reported on.9 The second evaluation metric, that of accuracy, provides a measure of the overall performance of the classifier, that is, of the proportion of correctly classified data and is given by Weiss and Provost (2003:322) as: (tp + tn) accuracy = - , (4.1) (tp + fp + tn + fh) where tp are true positives ('bad' examples correctly classified as 'bad'), ./pare false positives ('good' examples incorrectly classified as 'bad'), tn are true negatives ('good' examples correctly classified as 'good') and ./hare false negatives ('bad' examples incorrectly classified as 'good'). For this research project, accuracy is reported for the training set and the testing set. However, this measure is not sensitive to false negatives in particular, which is the misclassification type that most concerns this project. Accuracy is often evaluated in terms of a baseline, which is a useful indicator to use in determining whether performance is actually good or just appears to be good. It is determined by assigning the class label of the biggest class to each instance in the test data (Manning & Schiitze, 1999:234; see also Kessler et al., 1997). In this research project, the testing baseline for the experiments, using a 75 percent training set and a 25 percent testing set is 63.2 percent. This is derived from the test set, which consists of fifty-five 'good' and thirty-two 'bad' texts. Should the classifier simply label all the texts 'good' this will yield an accuracy of 63.2 percent. This figure is considered the baseline for this project. The training baseline for the experiments, using a 75 percent training set and a 25 percent testing set is 59.5 percent. This is derived from the training set, which consists of 154 'good' and 105 'bad' texts. Should the classifier simply label all the texts 'good' this will yield an accuracy of 59.5 percent. Any accuracy above the training and testing baselines can be considered to indicate an improved performance. However, the baselines are simply reported here for means of comparison with other studies. They are not inherently useful for this project, as they provide no indication of false negatives, that is, 'bad' examples incorrectly classified as 'good'. All the confusion matrices of the tests conducted can be found on the accompanying CD. The error rate can be calculated from these matrices if required. CHAPTER 4 111 In addition to baselines, performance that improves on the accuracies obtained by previous projects, on a particular task, is well established in Natural Language Processing as a means of evaluation (Manning & Schiitze, 1999:267). This too, however, will not adequately suffice as a means of evaluation for this research project as there is no previous work of the exact same classification task to which to make adequate comparison. As mentioned above, accuracy measures are not sensitive to false positives and false negatives. Precision, recall, and the F-measure are metrics that address this issue. Precision is a measure of the proportion of items correctly classified of those that were classified as a particular class and is given by Manning and Schiitze (1999:268) as: precision = . (4.2) tp + jp Recall is a measure of the proportion of items classified as a particular class and is given by Manning and Schiitze (1999:269) as: recall = ^ . (4.3) tp + fii Precision and recall can be combined into a measure known as the F-measure, a variant of the E-measure that was introduced by Van Rijsbergen (1975:174-175), where F = 1 - E. It is given by Manning and Schiitze (1999:269) as: F=— p (4.4) where P is precision, R is recall and a is a factor determining the weighting of precision and recall. Depending on the value selected for a, the F-measure can be made more sensitive to either precision or recall. This project reports on overall recall, but it is rather the recall of 'bad' examples (particularly of the testing set) that is most relevant in evaluating the results of the SVM classifier. This project is the most concerned with false negatives because classifying a 'bad' text as 'good' would mean that the incorrectly classified text would not be given sufficient attention by the senior marker, which would defeat the purpose of the classifier. Therefore, good performance for this project is to be determined by high RESULTS: TRAINING, EVALUATION AND INTERPRETATION 112 recall of 'bad' texts. Naturally, recall of the 'good' examples is also important, as it would not be helpful for the classifier to simply label every text 'bad'. In summary, errors in labelling 'good' examples are more acceptable (yet still undesirable) than errors in labelling 'bad' examples. The F-measure combines overall precision and recall and, as stated previously, can be adjusted to favour one above the other. However, as this project is concerned with the recall of 'bad' examples specifically and not just overall recall, as described above, the F-measure is not helpful. Therefore, this evaluation measure is not reported on (for an example of a study that makes use of this measure, see Shepard, Waters & Kennedy, 2004).96 In addition to the measures already discussed, this research project reports on the average recall for both 'bad' and 'good' examples. It should be noted that average does not mean overall, rather the average is calculated by dividing the sum of correctly classified texts (in percent) by 2. For this average, the baseline is 50 percent. Similar to the accuracy baselines, the average baseline is derived from the test set, which consists of fifty-five 'good' and thirty-two 'bad' texts. Should the classifier simply label all the texts 'good' this will yield a recall of 100 percent for the 'good' texts and a recall of 0 percent for the 'bad' texts. This value divided by 2 will yield a baseline of 50 percent. The next section presents the results of the SV classifiers built on each data set, using different kernels, features, and cases. 4.4 Results and discussion This section presents and discusses the results of the SV classifiers. The results for all four data sets are presented in Table 4.1 and are the best results obtained on the recall of 'bad' examples (henceforth RB). Detailed results for all tests using each parameter value, as discussed in Section 4.2.2, are available on the accompanying CD.97 The discussion of the results given in Table 4.1 follows below. 96 As with the error rate, the F-measure can be calculated from the confusion matrices on the accompanying CD if desired. The best value for the weight is usually 0.5. If this value is set to less than 1 recall is favoured; conversely, if set to more than 1 precision is favoured (Jurafsky & Martin, 2000:578). 97 These results are given in percent. Data set Feature set Type of SV classification Type of kernel Cases 98 RB RG AR Training accuracy Testing accuracy RAW Complete set C-SV Linear All (346) 62.50000 74.54545 68.52273 100 70.115 LOG Complete set c-sv Linear All (346) 31.25000 89.09091 60.17045 86.873 67.816 IDF Complete set C-SV Linear All (346) 34.37500 60.00000 47.18750 100 50.575 TLN Complete set c-sv Linear All (346) 46.87500 72.72727 59.80114 100 63.218 TLN Feature selection c-sv Linear All (346) 62.50000 74.54545 68.52273 62.934 70.115 TLN Penn Treebank POS tags C-SV Linear All (346) 40.62500 76.36364 58.49432 70.656 63.218 TLN CLAWS7 POS tags C-SV Linear All (346) 50.00000 63.63636 56.81818 76.448 58.621 TLN Top fifty BNC c-sv Linear All (346) 50.00000 61.81818 55.90909 58.301 57.471 TLN Complete set C-SV RBF All (346) 46.87500 80.00000 63.43750 100 67.816 TLN Complete set nu-SV Linear All (346) 46.87500 70.90909 58.89205 100 62.069 TLN Complete set nu-SV RBF All (346) 50.00000 43.63636 46.81818 56.757 45.977 TLN Complete set C-SV Linear MGR (263) 48.27587 52.94118 50.60852 98.446 50.794 RESULTS: TRAINING, EVALUATION AND INTERPRETATION 114 The results of the SV classifier, using each data set are discussed with reference to the text-length normalised data set. This is the set upon which further training and testing were conducted as it proved to perform best and is the most linguistically intuitive. Results for the raw data set are reported here, in order to determine whether it is necessary to normalise the data to the text length or whether satisfactory results can be obtained, using the data as is. This is a worthy pursuit as the text-length normalised data require substantially more manual effort than the raw data set. In addition to this, the results for the raw data set are reported for comparison with the results obtained on the log-transformed data set, in order to determine whether this transformation affects classification performance. The results of the C- S V classifier built on the text-length normalised data set, using the complete feature set, a linear kernel and with the parameter C = 0.3125, show a RB of 46.9 percent and a recall of 'good' examples (henceforth RG) of 72.7 percent. This yields an average recall (henceforth AR) of 59.8 percent, which is a 9.8 percent improvement on the average baseline, as discussed in Section 4.3. The results of this classifier can be compared with the results of the classifiers built on the log-transformed data and the idf data. The results of the C-SV classifier built on the log-transformed data set, using the complete feature set, a linear kernel and with the parameter C = 0.046625, have a RB of 31.3 percent and a RG of 89.1 percent. This yields an AR of 60.2 percent, which is a 10.2 percent improvement on the average baseline. However, as this project is concerned with RB rather then RG, it is this value that shows that the classifier built on the log-transformed data set underperformed the classifier built on the text-length normalised data set. The results of the classifier built on the text-length normalised data set can also be compared with the classifier built on the idf data set. The results of the C-SV classifier built on the idf data set, using the complete feature set, a linear kernel and with the parameter C = 1.953125, have a RB of 34.4 percent and a RG of 60.0 percent. This yields an AR of 47.2 percent, which is 2.8 percent below the average baseline. The RB of this classifier shows a 3.1 percent improvement over the RB of the classifier built on the log-transformed data set. As with the classifier built on the log-transformed data set, CHAPTER 4 115 the classifier built on the text-length normalised data set outperformed the classifier built on the idfdata. set. This is shown by a 12.5 percent gain on the RB value on the text-length normalised data set. From this, it is clear that results on the text-length normalised data are better than on the log-transformed data set and the idf data set. As previously noted in Section 4.2, the distribution assumptions made for SVM learning do not assume that the data is normally distributed. Thus, it would be expected that whether the data is normal or not would not affect performance. As discussed in Section 3.6, logarithmic transformation is intended to transform non-normal data so that they approximate a normal distribution. The AR values of the classifier built on the raw data set (68.5 percent) and the classifier built on the log-transformed data set (60.2 percent) differ by 8.3 percent. This difference is fairly substantial, especially if the separate RB values and RG values of the two classifiers are examined. It can be seen that the classifier built on the raw data set yields a RG of 14.6 percent more and a RB of 31.2 percent more than the classifier built on the log-transformed data set. It thus appears that the log-transformation had a detrimental effect on the classifier, which implies that the normality of the distribution has a negative effect on SVM learning. This means that this experiment cannot provide conclusive evidence for SVM learning not being affected by the normality of the data distribution. As the classifier built on the text-length normalised data set outperformed both the classifiers built on the log-transformed data set and the idf data set, as discussed above, all further tests were conducted, using the text-length normalised data set. Initially, tests were conducted by training on subsets of variables. These tests were conducted separately on selected variables, the Penn Treebank tags, the CLAWS7 tags, and the top fifty words of the BNC. Next, tests were conducted, using only the cases without middle-band marks. Thereafter, tests were conducted, training on the complete feature set but using the RBF kernel instead of the linear kernel. Finally, tests were conducted, using the complete feature set but using v- SV classification (with the linear kernel and the RBF kernel) instead of C- SV classification. First, the results on the raw data set and then the results of the tests as described above are presented. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 116 As previously mentioned, the results for the raw data set are reported here for comparison with the results obtained on the text-length normalised data set, in order to determine whether the gain in recall on the 'bad' examples necessitates the extra human effort required for the latter data set. The results of the C- SV classifier built on the raw data set, using the complete feature set, a linear kernel and with the parameter C = 1.953125, have a RB of 62.5 percent and a RG of 74.5 percent. This yields an AR of 68.5 percent, which is an 18.5 percent improvement on the average baseline. These results are favourable and show a 15.6 percent increase in RB over the text-length normalised data set. Furthermore, although both classifiers obtained 100 percent in training accuracy, the classifier trained on the raw data set overfitted the data by a smaller margin than the classifier trained on the text-length normalised data set. It is possible that if the texts are all of a similar length, text-length normalisation would not produce any perceptibly different classification results. However, the length of the essays used in this project range from approximately 800 to 4000 words. Therefore, the results would seem to indicate that text-length normalisation may be an unnecessary step for good classification. The good results on the raw data set equal the best results obtained on the C-SV classifier built on the text-length normalised data set, using feature selection, a linear kernel and with the parameter C = 32, in terms of RB and RG on the testing data set (the feature selection process is discussed in Section 3.6). The major difference between these two classifiers is their training results. As mentioned above, training on the raw data set with the complete feature set resulted in a 100 percent training accuracy. This contrasts greatly with the results of training on the text-length normalised data set with feature selection. The latter classifier resulted in a training accuracy of only 62.9 percent, which may be 3.4 percent above the baseline but which is 37.1 percent below the best training accuracy achieved, indicating substantially poorer overall performance. This provides further evidence for using the raw data set as is. The low training accuracy also has implications for feature selection itself. The training result shows that although feature selection may have helped improve testing accuracy, the removal of the rest of the feature set was detrimental to training. This means that the classifier trained on the reduced feature set underfitted the training data. It may be that CHAPTER 4 117 the features used in this project are all valuable to some extent in discriminating between the essays, and so the removal of the majority of the features resulted in a 37.1 percent loss in discriminatory ability. In Section 3.6, it was stated that feature selection does not appear to be as essential to SVM learning as to other machine learning techniques, such as decision trees, because SVM learning can learn well in high dimensions. That this is the case is not disputed here, but it does appear that there is some evidence in favour of more careful feature selection. This is so, not because too many features can produce poor results, but rather because too many irrelevant features can produce poorer results, and it is difficult to determine a priori, with certainty, which features are irrelevant. A good case is provided for more exploratory techniques, such as cluster analysis, before presenting the features to the LM. The higher accuracy results, using feature selection, thus provide some evidence towards the findings of some studies that have reported poor performance of SVMs during learning, with many irrelevant features (such as Weston et al. 2001). Thirty-eight of the 812 features were selected in the manner described in Section 3.6. These are: 1. seven Penn Treebank tags: JJ, NN, PDT, POS, TO, VBD, and VBG; 2. eleven CLAWS7 tags: GE, PPH02, PPHS2, RP, VBDR, VBDZ, VDD, VDI, VHD, VHN, and WZ; 3. two quotation tags: quotations non-integrated with text and sentence counts for non-integrated quotations; 4. readability scores: The ARI readability score; 5. number of references; 6. eight from the BNC top fifty: had, they, this, to, up, was, were, and when; 7. key function words of the top 1000 key words: this and to; 8. simple prepositions: to; 9. two-word complex prepositions: as to and on to; 10. private factual verbs: recall and find; 11. subordinating conjunctions: when; 12. factual stance adverbs: actually and never; 13. two-word factual stance adverbs: in fact; and RESULTS: TRAINING, EVALUATION AND INTERPRETATION 118 14. factual stance nouns: result. It should be noted that seven of the above features have multiple membership and are thus listed more than once in the above list. These are: 1. ' and 's, CLAWS7 tags and Penn Treebank tags; 2. had, CLAWS7 tags and the BNC top fifty; 3. was, CLAWS7 tags and the BNC top fifty; 4. were, CLAWS7 tags and the BNC top fifty; 5. when, subordinating conjunctions and the BNC top fifty; 6. this, key function words of the top 1000 key words and the BNC top fifty; and 7. to, Penn Treebank tags, key function words of the top 1000 key words, the BNC top fifty, and simple prepositions. It is interesting to note that the features that were selected are, in the main, function words or indicative of function words (POS tags). Words typically associated with academic writing (as discussed in Chapter 3) such as hedges, downtoners, and various kinds of linking adverbials were not selected as features. Contrary to what is expected, other words linked to academic writing, such as subordinating conjunctions and reporting verbs are composed of only three features: recall, find and when. Of course, this does not indicate that other academic words are not useful indicators of prototypical or non-prototypical argumentative essays. They might simply be less (by varying margins) useful indicators than the features selected. It is also possible that these words would have been selected if other kinds of feature selection techniques were used. In addition to the above remarks, other significant features were also selected. These features are quotations non-integrated with the text, sentence counts for non-integrated quotations, the ARI readability score, and the average number of references. The first two of these features are unique to this research project, and it therefore appears to be promising that they were selected as good discriminators of prototypical and non- prototypical argumentative essays. This means that they can be considered useful for future projects (ideas in this regard will be developed further in Chapter 5). The ARI readability score indicates that readability measures can provide useful information for CHAPTER 4 119 classifying prototypical and non-prototypical argumentative essays. This also encourages further exploration of readability measures (see Chapter 5). The selection of the average number of references as a good classifying feature is also encouraging. This is because it is expected that prototypical argumentative essays will put forth opposing or collaborative arguments and provide evidence for their argumentation, which would entail references to other research. Therefore, it makes linguistic sense that this feature is considered highly discriminatory. From the feature selection, it can also be deduced that the Penn Treebank tags, the CLAWS7 tags, and the top fifty words of the BNC are very useful for discriminating between prototypical and non-prototypical essays on this data set. The discriminatory ability of POS tags and the top fifty words of the BNC for automatic genre classification (as discussed in Chapter 2) is thus confirmed to some extent by the feature selection. Additional evidence for the usefulness of these two feature sets is presented in the form of results from three C- SV classifiers built on the text-length normalised data set, using the Penn Treebank tags, the CLAWS7 tags, and the top fifty words of the BNC respectively. The first classifier built on the Penn Treebank tags, using a linear kernel, and with the parameter C = 8.09375, yields a RB of 40.6 percent and a RG of 76.4 percent. This results in an AR of 58.5 percent, which is an 8.5 percent improvement on the average baseline. The second classifier built on the CLAWS7 tags, using a linear kernel and with the parameter C = 32, yields a RB of 50.0 percent and a RG of 63.6 percent. This results in an AR of 56.8 percent, which is a 6.8 percent improvement on the average baseline. The third classifier built on the top fifty words of the BNC, using a linear kernel and with the parameter C = 128, yields a RB of 50.0 percent and a RG of 61.8 percent. This results in an AR of 55.9 percent, which is a 5.9 percent improvement on the average baseline. These last two classifiers, built on the CLAWS7 tags and the top fifty words of the BNC, show an improvement on RB in comparison with the results obtained on the text- length normalised data set, using the full feature set. Used on their own, however, these RESULTS: TRAINING, EVALUATION AND INTERPRETATION 120 two feature sets do not have as high a discriminatory ability on the prototypical texts. It can thus be deduced that for this data set the CLAWS7 and the top fifty words of the BNC feature sets are useful for classifying non-prototypical texts, and that the Penn Treebank feature set is more useful for classifying prototypical texts. This is important for this project, as too many false positives would also be unhelpful for the user of such a classifier. This is so because a large number of false positives will result in many essays misclassified as 'bad'. This in turn will mean that more essays will be required to be marked by a senior marker. As the purpose of this classifier is to reduce the amount of unnecessary work for a senior marker, misclassifications that increase the amount of work for a senior marker will defeat the purpose of the classifier. The tests discussed above, which examined the discriminatory power of the Penn Treebank and CLAWS7 tags also provided partial evidence for the hypothesis put forth in Section 3.5.1. This hypothesis suggested that the CLAWS7 tagset, which extracts finer linguistic detail than the Penn Treebank tagset, might extract potentially useful discriminatory features. It suggested further that a classifier trained on this tagset would report more accurate results than a classifier trained on the Penn Tree tagset. If the results of the classifier trained on the CLAWS7 tagset are compared with the results of the classifier trained on the Penn Tree tagset, it can be seen that the discrimination ability for the non-prototypical examples is substantially improved by using the CLAWS7 tagset. As discussed above the classifier trained on the CLAWS7 tagset is better at classifying non-prototypical examples than the classifier trained on the Penn Treebank tagset, but worse at classifying prototypical texts. This implies that the hypothesis can be accepted, if better performance is defined by RB only. However, linguistically, it would be expected that a classifier trained on the CLAWS7 tagset would perform better than one trained on the Penn Treebank tagset in terms of both RB and RG. This is the intended interpretation of the hypothesis, which means that as the classifier trained on the CLAWS7 tagset did not perform better than the classifier trained on the Penn Treebank tagset for both classes, it can only be accepted partially. In addition to tests conducted, using different feature sets, tests were conducted, using the complete feature set but with a reduced selection of cases. These cases contain only those essays with upper- and lower-band marks, which were selected according to the CHAPTER 4 121 procedure detailed in Section 3.6, resulting in 263 cases for training and testing. The results of the classifier trained and tested, using this reduced set of cases, using a linear kernel and with the parameter C = 0.5, yields a RB of 48.3 percent and a RG of 55.9 percent. This yields an AR of 52.1 percent, which is 2.1 percent improvement on the average baseline. This can be compared with the results obtained for the classifier trained and tested on the text-length normalised data set, using the full case set (346 instances). Recall from the first results presented in this section, that this classifier has a RB of 46.9 percent and RG of 72.7 percent, yielding an AR of 59.8 percent. This shows that on RB the classifier trained and tested on the upper- and lower-band essays outperformed the full case set classifier on the RB values by 1.4 percent. This result is difficult to interpret, as on the one hand it may mean that the hypothesis presented in Section 3.6, which proposed that the linguistic differences of 'average' grade essays are not sufficiently discriminant to determine prototypical or non- prototypical essays, is to be accepted. One the other hand, the RG on the classifier trained and tested on the upper- and lower-band essays is 16.8 percent worse than the RG on the classifier trained and tested on the full case set. This implies that the hypothesis is to be rejected. Furthermore, the results of the classifier trained and tested on the upper- and lower-band essays are being compared with a classifier trained and tested on more instances. Naturally, the larger the case set the more likely it is that the classifier will classify training and testing instances more accurately. This idea was examined by training and testing ten classifiers, using the text-length normalised data set, on the complete feature set and a linear kernel, but with successive classifiers trained and tested on 10 percent more of the case set each time. This meant that the results for classification on 10 percent, 20 percent, 30 percent, 40 percent, 50 percent, 60 percent, 70 percent, 80 percent, 90 percent, and 100 percent of the case set were available for comparison." As previously mentioned, the classifier trained and tested on the upper- and lower-band grade instances was trained and tested on 263 instances. This corresponds to 76.0 percent of the case set and means that these results These sets were all sampled, using stratified random sampling, implemented in Statistica (Statsoft, 2006). This technique samples both the 'good' and 'bad' examples randomly, while retaining the same class proportions as those of the complete set. This technique is never less representative and often more representative than simple random sampling (Biber, 1993b:244). RESULTS: TRAINING, EVALUATION AND INTERPRETATION 122 can be reasonably compared with the results of the classifier trained and tested on 80 percent of the case set. The results of this classifier, using the parameter C = 0.5, yields a RB of 44.8 percent and a RG of 64.4 percent. This yields an AR of 54.6 percent, which is 4.6 percent above the average baseline. The performance shows a 3.5 percent increase in RB for the classifier trained and tested on the upper- and lower-band grade cases. However, this classifier also performed more poorly than the classifier trained and tested on 80 percent of the case set in terms of RG. Again, this provides mixed evidence as to whether the performance of the classifier trained and tested on the upper- and lower-band grade cases was necessarily due to case set size. It is possible that the results on both these classifiers are not due to overall case size but rather to the amount of texts for each class. In comparison, the 80 percent case set contains 125 training and 45 testing examples of 'good' texts, and 94 training and 29 testing examples of 'bad' texts, whereas the data set, using the upper- and lower-band grade cases has 96 training and 34 testing examples of 'good' texts, and 97 training and 29 testing examples of 'bad' texts. This means that the second classifier had less instances upon which to learn 'good' examples, but more instances upon which to learn 'bad' examples. This may partly explain the results on RG and RB. Evidence against the hypothesis presented in Section 3.6, that the linguistic differences of 'average' grade essays are not sufficiently discriminant to determine prototypical or non-prototypical essays, is given if the particular types of mistakes made on the training and testing on the full case set are considered. Fifteen out of the thirty-two mistakes made on the testing set are not expected according to the hypothesis, that is, the misclassified essays fall decidedly in the top band of prototypical argumentative essays and in the lower band of the non-prototypical argumentative essays. Similarly, forty out of the ninety-six mistakes made on the training set are not expected. That 46.9 percent for the testing set and 41.7 percent for the training set are not expected mistakes is additional evidence against the hypothesis that essays with middle-band marks are less easily distinguishable. This would seem to imply that either the features used do not adequately extract the linguistic differences between the essays, or that more training data are required. This is addressed further in Section 4.5. CHAPTER 4 123 After testing, using different cases, tests were conducted using a RBF kernel rather than the linear kernel. The RBF kernel was used because it is one of the most commonly used kernels for SVM classification (Hechter, 2004:56; also Statsoft, 2004) and is recommended as a first choice by Hsu et al. (2003:4). The results of this C-SV classifier built on the text-length normalised data set, using the complete feature set, with the parameter C = 31.8125 and with the parameter y=0.013139006, have a RB of 46.9 percent and a RG of 80.0 percent. This yields an AR of 63.4 percent, which is a 13.4 percent improvement on the average baseline. These results are better overall than the results obtained on the classifier trained on the text-length normalised data set, using the complete feature set but with a linear kernel because the former classifier achieved a RG of 7.3 percent more than the latter classifier. Finally, tests were conducted, using v-SV classification rather than C-SV classification and using the linear and RBF kernels. The results of the v- SV classifier trained on the text-length normalised data set, using the complete feature set, a linear kernel and with the parameter v = 0.08, have a RB of 46.9 percent and a RG of 70.9 percent. This yields an AR of 58.9 percent, which is 8.9 percent above the average baseline. In comparison with the C- SV classifier trained on the text-length normalised data set, using the complete feature set and a linear kernel, the RB is equal but the RG is less by 1.8 percent. The results of the v- SV classifier trained on the text-length normalised data set, using the complete feature set, a RBF kernel and with the parameter v = 0.08, have a RB of 50.0 percent and a RG of 43.6 percent. This yields an AR of 46.8 percent, which is 3.2 percent below the average baseline. This classifier performed better in RB than either the C- SV classifier trained on the text-length normalised data set, using the complete feature set and a RBF kernel, or the C-SV classifier trained on the text-length normalised data set, using the complete feature set and a linear kernel. However, the RG of the v-SV classifier is very poor: it is 36.4 percent below the RG of the C-SV classifier trained on the text-length normalised data set, using the complete feature set and a RBF kernel. Furthermore, it is also 29.1 percent below the RG of the C-SV classifier trained on the text-length normalised data set, using the complete feature set RESULTS: TRAINING, EVALUATION AND INTERPRETATION 124 and a linear kernel. As previously discussed, such a low RG will defeat the purpose of the classifier. The next section analyses why the results of the best classifier did not reach 100 percent recall of the 'bad' and 'good' examples.100 4.5 Analysis of the classifier's performance From the results and discussion presented in Section 4.4 above, it is apparent that the best results, those of the classifier trained on the text-length normalised data set, using feature selection, a linear kernel and C = 32, are quite good in terms of both RB, as well as the average and testing baselines. Furthermore, if judged in the light of the difficulty of the classification problem, which examines a very subtle genre type, these results are very good indeed. However, it is obviously still desirable for the classifier to have 100 percent recall of 'bad' and 'good' examples. This section suggests seven potential explanations for why the classifier did not perform more accurately and proposes several ways to improve on the classifier's performance. The following reasons are suggested: 1. The argumentative essays are not truly representative of non-prototypical examples. 2. The training set size could have been too small. 3. Cross-domain classification could have reduced accuracy. 4. Personal language differences could have confused classification. 5. The features used could not have adequately extracted the linguistic differences between the classes. 6. The feature selection might not have been adequate. 7. More knowledge of the discourse 'struggle' might need to have been considered. The first reason suggested for the classifier's performance is that the argumentative essays are not truly representative of non-prototypical examples. This reason is listed first because it is the most likely explanation for the classifier not being able to identify The code for this classifier can be found on the accompanying CD. CHAPTER 4 125 non-prototypical examples as easily as prototypical examples. Unfortunately, the essays this project uses are actually considered instances of good student academic writing (and thus argumentative essays) by the BAWE corpus compilers. The lowest grade for the essays in the corpus is 60 percent, which in comparison with a grade of 80 percent can be considered poor but in actuality is average rather than poor. This by no means implies that the whole corpus is average, as the majority of the essays are in the region of 68 percent and above. However, it is likely that the classifier would be able to better distinguish between the two classes, using the current features if the classes were more readily distinguishable, that is, if essays of fail marks made up the non-prototypical case set. This problem could not be rectified for this study, as very few LI English argumentative student essays are easily available for study. Furthermore, where such corpora are available, they do not always have the necessary features for training. For example, the LOCNESS (Granger, 1994) has had quotations removed from the texts, an important feature used in this research project. The second possible reason for the classifier not having achieved 100 percent RB and RG is a concern that was already raised in Section 2.3.2, that of adequate data set size. It is possible that the data set that was used to train the classifier could have been too small to allow optimal performance. In other words, the classifier may require more learning examples, in order to generalise well. It is often the case that accuracy is a function of the size of the training set. For example, Stamatatos et al. (2000b:491) show that the classification accuracy of their problem does improve with more training data. The best way of establishing whether accuracy is a function of training size is to train the classifier on an increasing number of texts. However, this approach requires the same testing set, which was not possible with the SVM implementation used for this project. This is because the training and testing examples are selected randomly by the algorithm. The third reason, that cross-domain classification could have reduced accuracy, is first preceded by the background to this problem. In Section 4.3, an evaluation indicator was introduced in the form of performance that improves on the accuracies obtained by RESULTS: TRAINING, EVALUATION AND INTERPRETATION 126 previous projects, on a particular task. It was noted that this means of evaluation could not be used directly for this research project as there is no previous work of the exact nature to which to make adequate comparison. However, it is clear that previous studies have provided some guidance in terms of features to be used, data transformation, and machine learning techniques. These studies were selected on the basis of their overall high performance. One study to which more direct comparison can be made in terms of evaluation of results is Finn's (2002) study. This is for two reasons, first because it is a two-class classification problem, as with this study, unlike the majority of all the studies reviewed in Chapter 2, which are concerned with multi-class classification. And second, because it is a more subtle classification task, more like the task of this project, than for example the letters and poems of Kelih et al.'s (2005) study, which are two genres that differ so widely as to render classification errors unlikely. This latter type of genre classification task is fairly representative of many of the other types of genres examined in the studies reviewed in Chapter 2. In addition to suitability because of classes and type of classification, Finn's (2000) study also focuses specifically on cross-domain performance, something that is not an explicit focus of any other study, as discussed in Section 2.3.2. As previously remarked, this focus on cross-domain accuracies is relevant to this research project where all the data are drawn from and classified across domains. It is thus expected that because the texts are drawn from across very different domains, accuracy will be negatively affected (see Finn, 2002:86). The idea that cross-domain classification adversely affects classification is thus put forth as an explanation of why the best SVM classifier did not achieve 100 percent recall of 'bad' and 'good' examples. Further evidence can be provided in this regard by Karlgren and Cutting's (1994:1072) poor results on the learned/humanities subset. The texts in this subset were instances of academic prose written within disciplines in the humanities faculty. As this project made use of texts of this nature, drawn not only from various disciplines in the humanities faculty but also from across all faculties except Engineering (see Section CHAPTER 4 127 3.2), this provides further evidence for cross-domain accuracy reducing classification accuracy. Wolters and Kirsten (1999) provide strong evidence that different word types (function and content words, versus function words only) are not equally discriminatory for texts from the humanities and from science and technology. This too shows that accuracy, particularly in academic discourse, is negatively affected by cross-domain classification. The fourth explanation is that personal language differences could have confused classification. Van Halteren, Baayen, Tweedie, Haverkort and Neijt (2005) argue for the existence of a human stylome, that is, idiosyncratic stylistic differences in writing that varies from individual to individual. They prove that a human stylome exists in the framework of authorship attribution studies, and find that "the differences between the 'personal' language versions of even nonspecialist writers are greater than expected so far" (2005:7.3). As the essays used in this research project are written by many different authors, it is also quite possible that the individual stylistic variations, as established by Van Halteren et al. (2005), could play a role in classification. In other words, differences that are not simply indicative of prototypical or non-prototypical examples but also of personal language styles were also extracted. Naturally, this would result in lower classification accuracy in terms of this classification task. The fifth potential explanation for why the classifier did not perform more accurately is that the features used could not have adequately extracted the linguistic differences between the classes. This is the most obvious and simple explanation, because identifying good discriminant features is one of the main concerns of automatic genre classification. Since the choice of highly discriminant features is central to automatic genre classification, solutions to selecting better features are important for future work. Therefore, suggestions in the form of different kinds of features that could be added, as well as different combinations of current features will be made in Chapter 5, as recommendations for future work. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 128 The sixth explanation is related not to overall feature choice, but rather to the selection of a subset of features from the main feature set. It is suggested that the feature selection that was made, might not have been adequate (see Yang & Pedersen, 1997 for more on other feature selection techniques). The seventh explanation is that more knowledge of the discourse 'struggle' might need to have been considered. Recall that in Chapter 1 the stance of this research project towards student academic writing entails viewing academic writing within disciplines as sites of discourse and identity struggle. Nystrand (1987:204) claims that any good text analysis needs to distinguish between "the structure of argument and the structure of communication". He explains these two notions in terms of the metaphor of a 'contract' writers have with readers. This 'contract' requires writers to establish shared knowledge, contextualise new information, and mark text boundaries, in order to indicate conceptual and narrative shifts. This project attempted to extract relevant information relating to the linguistic features of argumentative essays. It was done with the assumption that such features of writing are indicative of successful essays. It is quite possible, however, that in addition to this success in structuring argument, there may be evidence of a communicative role with the reader-marker, which has not been extracted and could provide more accurate classification. Furthermore, Swales (1990:54) cautions that knowledge of the underlying logic of gatekeeping (discourse norms) is also very important and that both surface features and underlying discourse norms are crucial to communicative success (and perhaps entry into the discourse community). As this project examined only surface features, it is possible that further knowledge and incorporation of the discourse norms and context would have allowed for more accurate classification of the essays. Attempting to trace and extract the structure of communication, as well as incorporating discourse contexts will require further research; this may prove troublesome, for Nystrand (1987:205) claims that a writer must strike "an effective balance between what needs to be said and what may remain unsaid". The next section provides a summary of the main results of the classifier, as well as the main conclusions reached regarding the classifier's performance. CHAPTER 4 129 4.6 Summary This chapter began by establishing the importance of selecting learning parameters, in order to obtain high generalisability. To this end, an overview of the grid search values used to determine the best learning parameters for this project was presented. As grid searches are often conducted in the context of v-fold cross-validation, this validation technique was elucidated. Furthermore, some theoretical concerns about the validity of this technique were delineated and alternatives to this technique were put forth, following Scholkopf and Smola (2002:217). Counters to why these suggestions could not be followed for this research project were also discussed. In the following section, three potential data concerns were addressed. The first was the problem of imbalanced data sets. A solution to this in the form of a penalty for the 'good' examples was provided, in order to prevent the LM assigning the label of the larger class ('good') to any particular text simply because there are more 'good' examples. The second problem that was addressed was differing misclassification costs. This refers to the relative severity of different misclassifications. It was placed in light of the difference in errors for this project, where the cost for classifying a 'bad' example as 'good' is more serious then classifying a 'good' example as 'bad'. The problem could not be resolved for this research project because the SVM package used offers SVM classification in the standard situation. The third problem that was raised was the normal distribution assumption of the data set. The problem is related to the SVM learning method, which assumes that both training and testing data are drawn from the same distribution, but which does not assume that this distribution is normal. The assumption established that the data are not required to be normally distributed. From this, it was hypothesised that data normalised by way of the logarithmic transformation would affect the classifier's results. Next, the following evaluation measures were presented: error rate, accuracy, the accuracy baselines, improved performance, precision, recall, the average recall baseline, RESULTS: TRAINING, EVALUATION AND INTERPRETATION 130 and the F-measure. In this section, it was shown that the recall of 'bad' examples is the most relevant measure for this research project. The results were then presented in terms of this measure in the following section. The results of the classifiers trained on various data and feature sets, using C- and v- SV classification, and the linear and RBF kernels were compared. The best performance of 62.5 percent RB was reported on a 75 percent training set and 25 percent testing set. This C- SV classifier was trained on the text-length normalised data set, using feature selection and a linear kernel. In the comparisons drawn between the results of the different classifiers, various hypotheses were addressed. These hypotheses were raised in Chapters 3 and 4. The first hypothesis was that text-length normalised frequencies would yield better results as frequencies would be less skewed. However, it was found that the text-length normalised frequencies, although linguistically and statistically motivated, appeared to yield no better results than the raw frequencies. The raw data set results were also easier to compute, as no feature selection was required, unlike that of the text-length normalised data set. The second hypothesis was directly assumed from the distribution assumption addressed earlier. It is expected that, because the data set is not assumed to be normal, normalisation of the data should not affect the classifier's results. If evaluated in terms of AR the results of the classifier trained on the text-length normalised data set and the classifier trained on the log-transformed data set are approximately the same. However, RB values of the two classifiers differed, which meant that the hypothesis could not be accepted or rejected. The third hypothesis was that the classifier trained on the CLAWS7 feature set would outperform the classifier trained on the Penn Treebank feature set because the CLAWS7 tagset extracts finer linguistic detail, than the Penn Treebank tagset. This hypothesis was only partially accepted because, on the one hand, the recall of 'bad' examples was much improved by using the CLAWS7 tagset, which is the most relevant measure of CHAPTER 4 131 evaluation for this project. On the other hand, the recall of the 'good' examples was higher when the Penn Treebank tagset was used. The fourth hypothesis put forth was that the middle-grade essays are linguistically harder to distinguish and that the removal of these essays would improve the accuracy of the classifier. This hypothesis could not be rejected or accepted because, although the classifier trained and tested on the set of middle-grade essays outperformed the classifier trained and tested on all the essays on RB, it was outperformed by the classifier trained on all the essays on RG. That sample size could have an affect on learning, was tested. This was disproved by tests conducted on a similar size data set, with instances of upper-, middle- and lower-band essays. Rather, it was found that the number of instances of each class that the classifier was trained and tested on played a role in classification performance. Furthermore, the types of errors made on the classifier trained and tested on all the instances were examined. This showed that errors made on 46.9 percent of the testing set and 41.7 percent of the training set were not expected, that is they were errors made on essays that were clear cases of upper- and lower-band grades. In addition to examining these hypotheses, feature selection was also explored. In Chapter 3, it was stated that feature selection does not appear to be as essential to SVM learning as to other machine learning techniques. This was not invalidated here, but rather, because of the favourable results obtained, using feature selection, evidence in favour of feature selection was presented. In the final section of this chapter, the best classifier's performance was analysed and seven potential explanations were provided to account for the less than 100 percent recall. In summary, these reasons were that the argumentative essays are not truly representative of non-prototypical examples, the training set size could have been too small, cross-domain classification could have reduced accuracy, personal language differences could have confused classification, the features used could have not adequately extracted the linguistic differences between the classes, the feature selection could not have been adequate, and more knowledge of the discourse 'struggle' might need to have been considered. RESULTS: TRAINING, EVALUATION AND INTERPRETATION 132 Chapter 5 will provide a conclusion to this dissertation by presenting a review of the findings of this research project. Thereafter, implications of this study for genre classification and academic writing will be discussed, and in light of these, recommendations for future research will be made. CONCLUSION AND RECOMMENDATIONS 133 CHAPTER 5 Conclusion and recommendations A picture says more than a thousand words, and for me the issue is how to listen to the picture the text paints, without being distracted by its words (Karlgren, 2000:131) 5.1 Introduction Chapter 5 provides a conclusion to this dissertation. This chapter commences with a detailed summary of the preceding chapters in Section 5.2. In this section, the research questions posed in Chapter 1 are discussed, in order to determine whether this study has answered them adequately. Thereafter, in Section 5.3, the results and findings presented in Chapter 4 are discussed, with particular reference to the hypotheses presented in Chapters 1, 3, and 4. The acceptance or rejection of these hypotheses, as discussed in Chapter 4, is also reported in this section. Section 5.4 concludes this chapter with recommendations for further studies on genre classification and academic writing, which are made on the basis of the findings of this study. 5.2 Summary of chapters In order to determine whether an automatic genre classification approach could be used for the classification task of this study, a detailed review of the state-of-the-art of this approach was provided in Chapter 2. The notions of machine learning and supervised learning, necessary to understanding the perspective of the classification task that this research project takes, were introduced along with the basic notation used for the machine learning process. These concepts were placed in the framework of automatic genre classification and the genre task of this project. Thereafter, the use of genre in this project was elucidated and the genre examined in this project explained. Once the preliminaries to understanding the nature of the classification task of this research project were established, studies in automatic genre classification were reviewed. This review was in three parts. In the first section, Biber's (1988) language variation study, which forms the background to studies in automatic genre classification, and two seminal works in the field of automatic genre classification were reviewed. In the second section, relevant contemporary automatic genre classification CHAPTER 5 134 studies were reviewed. These studies were deemed relevant to this research project regarding application, corpus, features, or methodology. The last three aspects are essential to any text classification task; therefore, these three aspects were detailed for each study. This was done for two reasons: first, to determine contemporary practice in the field of automatic genre classification, which partly established whether an automatic genre classification approach would suit the classification task of this research project. Second, to determine the potential value of these studies for informing the features, methodology and interpretation for this research project through critical assessment of the validity of pre-defined genre classes, results, evaluation measures, and the features of each study. This partly answered the second research question, by determining potential linguistic features that could be easily computed and used for this research project. The third section of the review reported on automatic genre classification studies that use SVMs for the purposes of genre classification. Studies using this technique were reviewed in a separate section because this machine learning technique had demonstrated the best results in comparison with other techniques in automatic genre classification tasks. In this manner, the expected performance of SVMs, as a machine learning technique, for this research project was determined and the fourth research question was thus partly addressed. In Chapter 3, the features and machine learning technique used in this research project were reported. It provided the background to the corpus and the data derived from the corpus. The features presented in this chapter were selected from studies reviewed in Chapter 2 and research in the field of academic discourse. They were deemed potentially relevant as good predictors of prototypical or non-prototypical argumentative essay. This provided the foundation for Chapter 4, and thus began to address the first research question. The processes of text preparation, annotation of the features, removal of multiple instances of features, data transformation, and pre-classification exploration of the feature set were also detailed in Chapter 3. The texts in the corpus were prepared before features extraction by removal of formatting, essay questions, essay titles, bibliography, CONCLUSION AND RECOMMENDATIONS 135 appendices, headings, footnotes, graphs, illustrations, tables, some punctuation, and equations. The character set of the texts was converted, in order to be compatible with the language models used in the SVMTagger (Gimenez & Marquez, 2004). In addition to this preparation, apostrophes and quotation marks were also standardised, in order to remove any ambiguity of single closing quotation marks and apostrophes, and to separate actual citations from play titles and book titles. Lastly, the texts were tokenised. The corpus was POS tagged, using SVMTagger (Gimenez & Marquez, 2004) and the CLAWS4 tagger (Garside, 1987:30). These two taggers were used to extract POS tags from the Penn Treebank and the CLAWS7 POS sets. This was done to determine whether coarse or fine POS distinctions would be more helpful as genre revealing features. Next, sentences, paragraphs, quotations, references, punctuation marks, nominalisations, two-word complex prepositions, and three-word complex prepositions were marked up, using XML tags. This was done, in order to extract features that could not be directly extracted from the text. Features were extracted, using STATISTICS Text Mining and Document Retrieval (Statsoft, 2006). Thereafter, the value of the dependent variable was standardised to percentage grading, and the data were cleaned by removing multiple occurrences of terms and thus double counts of single features. Next, the data were normalised to text length (as measured in words) and transformed using logarithmic transformation and inverse document frequency transformation. This resulted in four data sets: a set of raw data frequencies, a set of text-length normalised frequencies, a set of log-transformed frequencies, and a set of frequencies. Chapter 3 also addressed the process of feature selection, which removed potentially irrelevant features by selecting those features that contribute most to the classification of the texts. This feature selection partly addressed the first research question. The chapter concluded by presenting SVMs for the linearly separable and the non-linearly separable case. In this section, it was shown that in theory, this machine learning technique provides good generalisability and is manually undemanding, thus partially addressing the fourth research question. CHAPTER 5 136 The research questions that were only partially addressed in Chapters 2 and 3, as discussed above, were fully attended to in Chapter 4. This chapter established the importance of selecting learning parameters, in order to obtain high generalisability. As grid searches are usually used to select the optimal parameters (Eitrich & Lang, 2006:428), this project too conducted a grid search of different values, in order to determine the learning parameters that provided the best performance. To this end, an overview of the grid search values used to determine the best learning parameters for this project was presented. The performance of each of the values of the grid search can be tested using v-fold cross-validation or a separate training and testing set. The former method is the most usual method of determining the best learning parameters; therefore, this process was elucidated (Bennett & Campbell, 2000; also Scholkopf & Smola, 2002:217:10). Moreover, theoretical concerns of the validity of v-fold cross-validation were raised and two alternatives to this technique were suggested, following Scholkopf and Smola (2002:217). These suggestions, however, could not be followed for this research project. This chapter also raised and addressed three potential data problems that concern this research project. The first problem, that of imbalanced data sets, is relevant because the 'bad'/'good' essay ratio is 137/209, which could result in the classifier classifying any particular instance as 'good' simply because there are more examples of 'good'. This problem was solved, using a classification penalty on the 'good' examples. The second problem was that although both kinds of misclassifications are not equally severe, the SVM classifier treats the costs of both these misclassifications as equal. This problem could not be solved for the SVM implementation used in this research project. The third data problem was that SVMs assume the training and testing data to be drawn from the same distribution but that this distribution is not assumed to be normal. This led to a hypothesis, which is discussed in Section 5.3, regarding the normal distribution of the data. The following evaluation measures were presented next: error rate, accuracy, accuracy baselines, improved performance, precision, recall, the average recall baseline, and the F-measure. In light of the purpose of the classification task of this project the most suitable accuracy measure was discussed. For this project, correct classification of the CONCLUSION AND RECOMMENDATIONS 137 non-prototypical examples is the most important, therefore, the recall of 'bad' examples was the measure used to evaluate the classifier's performance. Furthermore, Chapter 4 reported on and drew comparisons between the results of the various classifiers trained and tested on different data and feature sets, using C-and v-SV classification, and linear and RBF kernels. In this chapter, it was shown that C-SV classification and the linear kernel showed better results than v-SV classification and the RBF kernel. For this reason, all further tests were conducted, using C- SV classification and the linear kernel. In this chapter, the various hypotheses that were raised in Chapters 3 and 4 were discussed. Finally, the best classifier's performance was analysed and seven possible reasons were suggested to explain the results of this classifier. The results of the classifiers, the hypotheses, and explanations for the classifier's performances are presented in Section 5.3, below. In this section, the research questions posed in Chapter 1 are fully answered. 5.3 Summary of results and findings As mentioned above, Chapter 4 detailed the results of all the classifiers trained, using v- and C- SV classification, linear and RBF kernels as well as various data, feature and case sets. Tests on the different data, feature, and case sets were motivated by hypotheses postulated in Chapters 3 and 4. The first hypothesis was that normalising the features to text length would remove skewness from the frequencies and would thus yield better results than the raw frequency set. Contrary to this, the classifier trained on the text-length normalised data set yielded the same results as the raw data set, if feature selection was used on the former set. Without feature selection, the text-length normalised data set yielded poorer results than the raw data set. It is possible that text-length normalisation would not produce any perceptibly different classification results from the raw frequencies if the texts were of similar length. This was not the case for this project as the essays' lengths ranged from approximately 800 to 4000 words. It would thus appear that, although normalisation for text length is linguistically and statistically motivated, raw frequencies could be used without normalisation. CHAPTER 5 138 The second hypothesis was drawn directly from the data distribution assumption of SVMs, as mentioned in Section 5.2 above. As discussed in Section 3.6, the log- transformed data set was transformed, using logarithmic transformation, in order to transform non-normal data to approximate a normal distribution. It was expected that, because the data set was not assumed to follow a normal distribution, normalisation of the data would have no effect on the classifier's results (on 'good' and 'bad' examples). The results on the text-length normalised data set and the log-transformed data set showed approximately the same AR values, which seems to imply that normalisation had not affected performance. Nevertheless, the RB values and RG values of the classifiers differed. As a result, the second hypothesis could not be accepted or rejected because this experiment could not provide conclusive evidence of the normality of the distribution not affecting the results of S VM learning. The third hypothesis was that a classifier trained on a tagset, which extracts finer linguistic detail (the CLAWS7 tagset) than the Penn Treebank tagset, would perform more accurately than one built on the Penn Treebank tagset. Comparison of the RB values of the classifier trained on the CLAWS7 tagset and the classifier trained on the Penn Tree tagset showed that the classifier trained on the CLAWS7 tagset outperformed the classifier trained on the Penn Tree tagset. Conversely, comparison of the RG values of the classifier trained on the Penn Treebank tagset and the classifier trained on the CLAWS7 tagset showed that the classifier trained on the Penn Treebank tagset outperformed the classifier trained on the CLAWS7 tagset. Hence, it could be deduced that the CLAWS7 tagset was more discriminant of non-prototypical argumentative essays and the Penn Treebank tagset more discriminant of prototypical argumentative essays. As with the second hypothesis, the third hypothesis was intended to be interpreted in terms of better classification of both classes. Therefore, the third hypothesis could only be partially accepted. The fourth hypothesis put forth was that middle-grade essays are linguistically harder to distinguish, and therefore, the difference in linguistic features used in this project were not sufficiently discriminant to determine whether texts were instances of prototypical or non-prototypical argumentative essays. Therefore, it was postulated that the removal of middle-grade essays would improve the classifier's performance. The results of the CONCLUSION AND RECOMMENDATIONS 139 RB values of the classifier demonstrated that the classifier trained and tested on the reduced cases was not outperformed in RB by the classifier trained and tested on the full case set. It would thus appear that the hypothesis should be accepted. However, the experiment could not be considered conclusive without further evidence because the results on RG showed that the classifier trained and tested on the full case set outperformed the classifier trained and tested on the reduced cases, and because the former classifier was being compared with a classifier trained and tested on more observations. Training sample size, first raised in Chapter 2, was a concern because a training sample should be linguistically representative of the target population and the classifier should have equal experience of both classes. As a result, the larger the data set the more likely accurate classification would be. The effect of sample size was examined by training a classifier, using the text-length normalised data set, on the complete feature set and a linear kernel, but with 80 percent of the case set. This classifier could be reasonably compared to the classifier trained and tested on the upper- and lower-band grade instances as it was trained and tested on 76 percent of the case set. From the RB values it could be seen that the classifier trained and tested on 76 percent of the case set showed a better performance than the classifier trained and tested on 80 percent of the case set. Conversely, from the RG values, it could be seen that the classifier trained and tested on 76 percent of the case set showed a poorer performance than the classifier trained and tested on 80 percent of the case set. This did not suffice to provide evidence that the performance of the middle-grade case set classifier was not due to data set size. The size of the two classes that each classifier had to train and test on were examined. The 80 percent case set contained 125 training and 45 testing instances of 'good' examples, and 94 training and 29 testing instances of 'bad' examples. The middle-grade case set had 96 training and 34 testing instances of 'good' examples, and 97 training and 29 testing instances of 'bad' examples. Thus, it was suggested that the results of the two classifiers trained, using the different case sets could be explained in terms of the number of examples of each class that they were trained and tested on. CHAPTER 5 140 Evidence against the fourth hypothesis was given by the particular types of errors made on the training and testing on the full case set. According to this hypothesis, fifteen out of the thirty-two errors on the testing set were not expected, that is, these misclassified essays fell unequivocally into the top band of prototypical essays and into the lower band of the non-prototypical essays. Equally, forty out of the ninety-six errors made on the training set were not expected. As 46.9 percent of errors on the test set and 41.7 percent of errors on the training set were not expected, this provided evidence for rejecting the fourth hypothesis. In addition to these hypotheses, feature selection was also explored. In Chapter 3, it was stated that as SVMs can learn well in high dimensions, feature selection does not appear to be essential to this type of learning. This was not shown to be false here, but rather, evidence in favour of careful feature selection was presented because of the good results obtained, using feature selection. It appeared to be the case that for SVM learning it was not too many features, but rather too many irrelevant features that could produce poorer results. As previously mentioned, it was difficult to determine a priori which features were irrelevant. A good case, however, was made in this research project for exploratory techniques, such as cluster analysis and decision trees (as used by Wolters & Kirsten, 1999), before attempting final classification tests. The good results on the raw data set equalled the best results obtained on the C- SV classifier trained on the text-length normalised data set, using feature selection, a linear kernel and with the parameter C = 1.953125, in terms of RB and RG on the testing data set. The major difference between these two classifiers was their training results. As mentioned above, training on the raw data set with the complete feature set resulted in a 100 percent training accuracy. This contrasted greatly with the results of training on the text-length normalised data set with feature selection. The latter classifier resulted in a training accuracy of only 62.9 percent, which is 3.4 percent above the baseline but 37.1 percent below the best training accuracy achieved, indicating substantially poorer overall performance. This provided further evidence for using the raw data set with no feature selection. CONCLUSION AND RECOMMENDATIONS 141 In Chapter 4, it was reported that although the testing accuracy was much improved by using feature selection, training accuracy was reduced (this was judged in comparison with the results on the raw data set, and could also be seen in the results of the classifier trained on the full feature set, using the text-length normalised data set). The training result showed that the removal of the rest of the feature set was detrimental to training, as the classifier trained on the reduced feature set was shown to underfit the training data. This provided evidence in favour of the feature set as a whole, that was selected for this classification task, as it appeared that these features were all valuable to some extent in discriminating between the essays. Thus, this research project provided evidence in support of Dewdney et ah (2001:7) and Finn (2002:76) who find that both BOW and linguistic features can be used well together. Moreover, this project confirmed that the BOW approach, if used with more careful word selection could be useful and did not need be viewed as opposing a linguistic feature set (this was not presented as such but is evident in Argamon & Dodick, 2004a; and Santini, 2005b). As feature selection and particular feature selection techniques are not guaranteed to select only those features that are relevant to the classification task the selected features must be interpreted with caution and inferences made tentatively. It was pointed out in Chapter 3 that the feature selection was conducted to test classification performance, not to make any inferences on any particular feature selection, with final implications for this study. The thirty-eight features that were selected were thus reviewed and their significance discussed but these did not indicate the absolute usefulness of the remaining features as it had already been shown that the majority of the features had a useful discriminating function. These features were mainly function words or indicative of function words (POS tags). Three features were kinds of subordinating conjunctions and reporting verbs that are words linked to academic writing. Other features were non- integrated quotations, sentence counts for non-integrated quotations, the ARI readability score, and the number of references. The first two features are unique to this research project, thus, it appears promising that these features were selected as good indicators of prototypical and non-prototypical essays. This could then indicate that they can be considered useful for future automatic genre classification projects. The selection of the average number of references is pleasing because argumentative essays are required to CHAPTER 5 142 provide evidence for their argumentation, which would entail references to other research. Therefore, this feature is expected to be characteristic of argumentative essays. Contrary to what was expected, words characteristic of academic writing, such as hedges, downtoners, and conjuncts were not selected during feature selection. As indicated above, this could not be used to infer that these words are not useful features for this classification task. These features may rather be less (by varying margins) discriminatory indicators than the features selected. Also, it is also possible that these features would have been selected if other feature selection techniques had been used. Thus, from the above discussion it can been seen that the results of the various classifiers as well as the testing of the hypotheses, postulated in Chapters 3 and 4, have fully addressed the first, second and fourth research questions. As reported in Chapter 4, the best classifier's performance was 62.5 percent RB, 74.5 percent RG and 68.5 percent AR on a 75 percent training set and 25 percent testing set. This C- SV classifier (C = 32) was trained on the text-length normalised data set, using feature selection and a linear kernel. In practice, this means that out of 87 essays classified, 34 essays would be given to the senior marker and 53 essays to the junior marker. Of the essays given to the senior marker, 14 would be incorrectly classified, while 12 essays given to the junior marker would be incorrectly classified. From the discussion in Chapter 4, as well as in light of the difficulty of the classification task it is apparent that these results are very good. They are judged so in terms of both RB, and the average and testing baselines. Nevertheless, it is still desirable to achieve 100 percent recall on 'bad' and 'good' examples. As mentioned in Section 5.2, Chapter 4 analysed the best classifier's performance and provided seven potential explanations to account for the less than 100 percent recall performance of the classifier. The first reason was that argumentative essays were not truly representative of non-prototypical examples. This reason was postulated as the most likely explanation for the classifier not having been able to identify non- prototypical essays as easily as prototypical essays. The data used in this project are all CONCLUSION AND RECOMMENDATIONS 143 relatively prototypical to very prototypical argumentative essays, and considered indicative of proficient student writing by the BAWE corpus compilers. The lowest grade for the essays in the corpus is 60 percent. The classifier ought to still generalise well if provided true instances of prototypical and non-prototypical argumentative essays because it is very good at identifying prototypical argumentative essays. On the data set used in this project it would be more likely that the classifier would be able to distinguish better between the two classes if the 'bad' class was truly non-prototypical, that is, if essays of fail marks made up the non-prototypical case set. This problem could not be solved for this research project, as very few LI English argumentative student essays are available for study. Moreover, where such corpora exist, they do not always have the necessary features for training; for example the LOCNESS (Granger, 1994) has had quotations removed from the texts, from which five important features were derived for this research project. The second reason was that the size of the data set may not have been adequate. It is possible that the classifier might have required more learning examples, in order to generalise well and thus have optimal performance. The best way of establishing whether accuracy is a function of training size is to train the classifier on an increasing number of texts, and then test the classifier on the same testing set each time. This was not possible with the SVM package used for this project because the training and testing examples were selected simultaneously (and randomly) by the algorithm. This meant that the testing set could not be independently selected. The third reason was that cross-domain classification could have reduced accuracy. Evidence for this was provided in Finn (2002:86), Wolters and Kirsten (1999), and Karlgren and Cutting (1994:1072). Thus, it was expected that for this research project, in which all the data were drawn from and classified across domains, accuracy would be negatively affected. The fourth explanation is that personal language differences could have confused classification. Van Halteren, Baayen, Tweedie, Haverkort and Neijt (2005) prove that a human stylome exists in the framework of authorship attribution studies, and find that CHAPTER 5 144 personal language differences play a great role in classification of these studies, even in the case of non-specialist writers (2005:14). Since the essays used in this research project were written by different authors, it is likely that individual stylistic variations, as established by Van Halteren et al. (2005) could affect classification. Thus, differences that are not only indicative of prototypical or non-prototypical essays but also of personal language styles were also extracted, resulting in a poorer performance in classifying prototypical and non-prototypical essays. The fifth explanation was the most obvious and simple explanation that the features used could not have adequately extracted the linguistic differences between the classes. As the identification of good discriminant features is one of the main concerns of automatic genre classification, solutions to choosing better features are important for future work. Therefore, suggestions in this regard are made in the next section. The sixth explanation was that the feature selection techniques used may not have been adequate to extract the most relevant features from the feature set. This topic is addressed in the form of potential solutions to better feature selection in Section 5.4, below. The seventh explanation was that more knowledge of the discourse 'struggle' could have needed to be considered. This explanation is derived from the approach this project took towards student academic writing, as stated in Chapter 1. Nystrand (1987:204) claims that any good text analysis needs to distinguish between "the structure of argument and the structure of communication". This project aimed to extract relevant linguistic features of argumentative essays. It is possible that, in addition to these features, there may be evidence of a communicative role with the reader-marker, which had not been extracted, and which could provide more accurate classification. This section has thus shown that results of the best classifier as presented in Chapter 4, answered the fourth research question fully. This study can thus be shown to have achieved all of its aims, by developing a classifier, using an automatic genre classification approach, which will categorise prototypical and non-prototypical argumentative essays of student writers, into 'good' or 'bad' classes. In order to do this CONCLUSION AND RECOMMENDATIONS 145 the most discriminating features between prototypical argumentative essays and non- prototypical argumentative essays were determined in an initial literature review and experiments on classifier results, using different feature sets. Furthermore, only linguistic features that can be easily computed were used. The classifier that was developed was trained on SVM learning, a technique that has been shown to provide good generalisability in automatic genre classification studies (Dewdney et al., 2001; Argamon & Dodick, 2004a; Meyer zu Eissen & Stein, 2004; and Santini, 2005b) and in other classification tasks (Burges, 1998:121, Scholkopf & Smola, 2002:22; Kroon, 2003:55; and Stecking & Schebesch, 2003). This machine learning technique requires human effort only in determining the optimal learning parameters, as discussed in Chapter 4. The next section discusses recommendations for further research in light of the above discussion. 5.4 Recommendations for further research The recommendations in this section are for furthering a study of the nature of this project. Therefore, the recommendations could be applied to automatic genre classification studies seeking to classify academic texts and to similar NLP applications where the object is to explore the linguistic features of academic writing. These recommendations are derived from this research project as well as other studies. They are grouped into three areas: features, evaluation, and learning technique, and are listed separately below. 5.4.1 Features In Section 5.3, it has been suggested that it is possible that the linguistic features of this project did not adequately extract the differences between the two classes under examination. A variety of features could potentially improve accuracy for a further study of a similar nature to this research project. However, only a few features are suggested here: additional quotation information, word length, readability scores, feature ratios, POS trigrams, parsing, and stability. The first four features are surface features, which are the most used type of features because they are easy to compute and have been demonstrated to achieve good accuracy CHAPTER 5 146 in the field of automatic genre classification (this study, and also see the studies reviewed in Chapter 2). The first feature is additional quotation information. This study considered quotations and the verbs associated with them (see reporting verbs in Section 3.3.6.4) as potentially revealing features for distinguishing prototypical and non- prototypical examples of argumentative essays. Quotations could be further explored by examining their location in sentences, and the types of reporting verbs could be grouped according to the type of reporting: direct quotation, paraphrase, summary, and generalisation (Hyland, 2002:116). The second feature is word length in syllables. Kelih et al. (2005), reviewed in Chapter 2, provide compelling evidence for word length being a powerful discriminator of genres, if defined as syllables per word. The third feature is readability measures. In Chapter 4, it is shown that the ARI readability score was a good discriminant feature for classifying prototypical and non- prototypical essays. This encourages further exploration of readability measures, such as those in Boese (2005). The fourth feature is ratios of features, which can be of many types. Two examples are the ratio of determiners to nouns and the ratio of prepositions to nouns. These kinds of ratios are used in Ross and Hunter's (1994) stylistic description tool. The final three features go beyond more typical surface features. The first suggested feature set is POS trigrams that Santini (2004a) used, as indicators of deeper syntactic features. As was seen in this study, reviewed in Chapter 2, good accuracies resulted from POS trigrams. The second feature set is the functional cues and syntactic patterns used by Santini (2005b), reviewed in Chapter 2, which are extracted with a parser. These two features sets are expected to reveal more information regarding a text's syntax than, for example, simple POS tags. The third feature is the measure on linguistic features introduced by Koppel, Argamon, and Shimoni (2003), which they term stability. Features are considered stable if they have semantic equivalents, which can replace them without changing the meaning of the CONCLUSION AND RECOMMENDATIONS 147 text. They show that features with low stability can be useful for stylistic text classification. This feature is expected to yield potentially useful information with regard to a text's semantic structure. Up to the present, automatic genre classification has not made use of semantically-orientated linguistic features. Semantic analysis has already been proven successful in essay grading (Lemaire & Dessu, 2001). In addition to experimenting with the features detailed above, features could be collapsed into smaller categories. This could be done on the basis of analysis by hand and then grouping according to function. Alternatively, features could be combined, using, for example, Principal Components Analysis (Hand, 1997:149-151). 5.4.2 Evaluation A variety of standard evaluation measures were detailed in Chapter 4. According to Hand (1997:100-114), the following four evaluation measures provide a more comprehensive method of evaluation (these measures require access to the probabilities of belonging to the classes being examined): 1. Inaccuracy: judges how ineffective a classification rule is in classifying an object correctly. This is based on a measure of the difference between the true class and the estimated probability of belonging to this class. 2. Imprecision: the difference between estimated probabilities and true probabilities of belonging to a particular class. 3. Inseparability: this determines whether the classification situation is indeed completely separable, by establishing whether the true probabilities of belonging to each class at x, averaged over x, are similar. Similarity will indicate that the problem is inseparable. 4. Resemblance: measures the variation between the true probabilities of belonging to one class above the decision surface and to another class below the decision surface. CHAPTER 5 148 Thus, a good classifier is one with high accuracy, high precision, and low resemblance. This means low inaccuracy coefficients, imprecision coefficients and resemblance coefficients (Hand, 1997:100). 5.4.3 Learning technique Recommendations by way of the learning technique used, refer to either using a different implementation of SVMs or combining classifiers.1 1 Thus, different classifiers can be trained, using the same learning methodology but, for example, training the classifiers, using different feature sets. The final classifications can be done, using bagging, where the chosen classification is the one most often selected by all the classifiers, or boosting, where all the classifiers contribute to the final vote by using weighted voting (Hand, 1997:159-162). An example of such an approach is the ensemble learning approach of the follow-up study to Finn (2002), which was reviewed in Chapter 2. In this study, Finn (2002:76) suggests combining feature sets in classification. This suggestion is taken up in Finn and Kushmerick (2003) in which they compare accuracies of Finn (2002) with a classifier built, using the combined feature sets. They make use of an ensemble learner, which is the result of three classifiers trained on the same data, each using a different feature set. They find that their ensemble learner often performs significantly better in comparison with the performance of individual feature sets. 5.5 Summary This chapter provided a conclusion to this dissertation. It summarised the preceding chapters, presented the results and findings of Chapter 4, and reviewed the hypotheses postulated in Chapters 3 and 4. This was done to show how the research questions posed in Chapter 1 were adequately addressed by the research project. Thus, this chapter demonstrated that the research aims of this study have been achieved by developing an automatic genre classifier that categorises prototypical and non- prototypical instances of the argumentative essay genre, written by students. For a comprehensive list of SVM implementations see http://www.support-vector- machines.org/SVMsoft.html. CONCLUSION AND RECOMMENDATIONS 149 This was achieved by first reviewing automatic genre classification studies, in order to determine the potentially most discriminating features of prototypical and non- prototypical argumentative essays. Thereafter, the features used for this research project were selected from these studies, as well as two well-known English grammar books: Biber et al. (1999) and Quirk et al. (1985). In Chapters 3 and 4, various hypotheses regarding best feature sets, case sets and data transformations were posed. These were examined and then accepted or rejected. This acceptance or rejection, summarised in this chapter, holds relevant implications regarding the most discriminant linguistic features for prototypical and non-prototypical argumentative essays. Finally, the fourth research aim of this study was achieved by developing the classifier, using SVM learning. Finally, this chapter concluded by putting forth suggestions for future research regarding features, evaluation, and learning technique. REFERENCE LIST 150 REFERENCE LIST ABE, S. 2005. Support vector machines for pattern classification. London: Springer. 343 p. AIRES, R.; ALUISIO, S.; & SANTOS, D. 2005. User-aware page classification in a search engine. (In the Proceedings of Stylistic Analysis of Text for Information Access, SIGIR Workshop, Salvador, Brazil, August 19). www.linguateca.pt/documentos/AiresetalSIGIR2005.pdf Date of access: 6 Jan. 2008. AIZERMAN, A.; BRAVERMAN, E.M; & ROZONER, L.I. 1964. Theoretical foundations of the potential function method in pattern recognition learning. Automation and remote control, 25:821-837. ALUISIO, S.M.; PINHEIRO, G.M.; FINGER, M.; NUNES, M.G.V.; & TAGNIN, S.E. 2003. The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. (In Archer, D.; Rayson, P.; Wilson, A.; & McEnery, T., eds. The Proceedings of Corpus Linguistics, Lancaster, U.K. UCREL Technical Papers, Vol. 16, p. 14-21.) ftp://ftp.ime.usp.br/pub/mfinger/2003/Lacio.pdf.gz Date of access: 6 Jan. 2008. ARGAMON, S. & DODICK, J. 2004a. Conjunction and modal assessment in genre classification: a corpus-based study of historical and experimental science writing. (In Qu, Y.; Shanahan, J.; & Wiebe, J., eds. The Proceedings of the AAAI Spring Symposium on Attitude and Affect in Text: Theories and Applications, Palo Alto, California, U.S.A., March. Menlo Park, CA: AAAI Press, p. 1-8.) ARGAMON, S.& DODICK, J. 2004b. Linking rhetoric and methodology in formal scientific writing. (In Forbus, K.; Gentner, D.; & Regier, T., eds. The Proceedings of the 26l Annual Conference of the Cognitive Science Society, Chicago, Illinois, U.S.A., August, p. 61-65.) http://www.cogsci.rpi.edu/CSJarchive/Proceedings/2004/CogSci04.pdf Date of access: 6 Jan. 2008. ARGAMON, S.; KOPPEL, M.; & AVNERI, G. 1998. Routing documents according to style. (In Schwartz, D.; Divitini, M. & Brasethvik, T., eds. The Proceedings of IIIS-98: 1st International Workshop on Innovative Internet Information Systems, Pisa, Italy, June 8-9.) http://citeseer.ist.psu.edU/cache/papers/cs/2891/http:zSzzSzwww.idt.ntnu.nozSz~mon icazSziiis-98zSzpaperszSzargamon.pdf/argamon98routing.pdf Date of access: 6 Jan. 2008. ASTON, G. & BURNARD, L. 1998. The BNC handbook: exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. 256 p. BELLMAN, R.E. 1961. Adaptive control processes: a guided tour. Princeton, NJ: Princeton University Press. 255 p. REFERENCE LIST 151 BENNETT, K.P. & CAMPBELL, C. 2000. Support vector machines: hype or hallelujah? SIGKDD Explorations, 2(2): 1-13, Dec. http://www.acm.org/sigs/sigkdd/explorations/issue2-2/bennett.pdf Date of access: 6 Jan. 2008. BEUTEL,B. 1998. Malaga User Manual. http://www.linguistik.unierlangen.de/Malaga.de.html Date of access: 6 Jan. 2008. BIBER, D. 1986. Spoken and written textual dimensions in English: resolving the contradictory findings. Language, 62(2):384-414, Jun. BIBER, D. 1988. Variation across speech and writing. London: Cambridge University Press. 299 p. BIBER, D. 1989. A typology of English texts. Linguistics: an interdisciplinary journal of the language sciences, 27(l):3-34. BIBER, D. 1993a. Using register-diversified corpora for general language studies. Computational linguistics. 19(2):219-242, Jun. BIBER, D. 1993b. Representativeness in corpus design. Literary and linguistic computing. 8(4):243-257. BIBER, D.; CONRAD, S.; & REPPEN, R. 2000. Corpus linguistics: investigating language structure and use. Cambridge: Cambridge University Press. 300 p. BIBER, D.; CONRAD, S.; REPPEN, R.; BYRD, P.; HELT, M.; CLARK, V.; CORTES, V.; CSOMAY, E.; & URZUA, A. 2004. Representing language use in the university: analysis of the TOEFL 2000 Spoken and Written Academic Language corpus. Report: ETS TOEFL Monograph series MS-25, January, RM-04-03. Princeton, NJ: Educational Testing Service. 303 p. http.7/www.ets.org/Media/Research/pdf/RM-04-03.pdf Date of access: 6 Jan. 2008. BIBER, D.; JOHANSSON, S.; LEECH, G.; CONRAD, S.; & FINEGAN, E. 1999. The Longman grammar of spoken and written English. Pearson: Essex. 1204 p. BIS ANT, D. 2005. An application of neural networks to sequence analysis and genre identification. International journal of pattern recognition and artificial intelligence, 19(2):199-215,Mar. BLANKENSHIP, J. 1962. A linguistic analysis of oral and written style. Quarterly journal of speech. 48(4):419^22, Dec. BOESE, E.S. 2005. Stereotyping the web: genre classification of web documents. Fort Collins: Colorado State University. (Thesis — M.Sc.) 68 p. http://www.cs.colostate.edu/~boese/Research/masters.pdf Date of access: 6 Jan. 2008. REFERENCE LIST 152 BOESE, E.S. & HOWE, A.E. 2005. Effects of web document evolution on genre classification. (In Herzog, O.; Schek, H.; Fuhr, N.; Chowdhury, A.; & Teiken, W., eds. The Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), Bremen, Germany, October 31 - November 5. New York, NY: ACM Press, p. 632-639.) http://www.cs.colostate.edu/~boese/Research/Publications/CIKM05-boese.pdf Date of access: 6 Jan. 2008. BOSER, B.E.; GUYON, I.M.; & VAPNIK, V.N. 1992. A training algorithm for optimal margin classifiers. (In D. Haussler, ed. The Proceedings of the 5th Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, U.S.A. New York, NY: ACM Press, p. 144-152.) http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/11260/http:zSzzSz www.clopinet.comzSzisabellezSzPaperszSzcolt92.pdf/boser92training.pdf Date of access: 6 Jan. 2008. BRANTS, T. 2000. TnT: a statistical part-of-speech tagger. (In the Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, Washington, U.S.A., April 29 - May 3. San Francisco, CA: Morgan Kaufmann. p. 224-231). http://www.aclweb.Org/anthology-new/A/A00/A00-1031.pdf Date of access: 6 Jan. 2008. BRASLAVSKI, P. & TSELISCHEV, A. 2005. Experiment on style-dependent document ranking. (In the Proceedings of the 7th Russian Conference on Digital Libraries (RCDL'05), Yaroslavl, Russia, October 4-6). http://www.rcdl2005.uniyar.ac.ru/ru/RCDL2005/papers/sek7_l_paper.pdf Date of access: 6 Jan. 2008. BRILL, E. 1992. A simple rule-based tagger. (In the Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLP'92), Trento, Italy, March 31 - April 3. Morristown, NJ: Association for Computational Linguistics, p. 152-155). http://www.aclweb.Org/anthology-new/A/A92/A92-1021.pdf Date of access: 6 Jan. 2008. BUCKINGHAM SHUM, S.; UREN, V.; LI, G.; DOMINGUE, J.; & MOTTA, E. 2002. Visualizing internetworked argumentation. (In Kirschner, P.A.; Buckingham Shum, S.J. & Carr, C.S., eds. Visualizing argumentation: software tools for collaborative and educational sense-making. London: Springer. 216 p.) BURGES, C.J.C. 1998. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2): 121-167, Jun. BURSTEIN, J.; MARCU, D.; ANDREYEV, S.; & CHODOROW, M. 2001. Towards automatic classification of discourse elements in essays. (In the Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL'01), Toulouse, France, July 9-11. San Francisco, CA:Morgan Kaufmann. p. 98-105.) http://www.isi.edu/~marcu/papers/thesis-statements-acl01.pdf Date of access: 6 Jan. 2008. REFERENCE LIST 153 CALANDO,A. 2004. Utrac-0.3.0. Alliance MCA. http://utrac.sourceforge.net/download/utrac-0.3.0.tar.gz Date of access: 6 Jan. 2008. CHAFE, W. 1982. Integration and involvement in speaking, writing and oral literature. (In Tannen, D., ed. Spoken and written language: exploring orality and literacy. Norwood, N.J.: Ablex. p. 35-54.) CHAFE, W & DANIELEWICZ, J. 1987. Properties of spoken and written language. (In Horowitz, R & Samuels, S.J., eds. Comprehending oral and written language. New York, NY: Academic Press, p. 83-113.) CHANG, C.-C. & LIN, C.-J. 2001. Training v- support vector classifiers: theory and algorithms. Neural computation, 13(9):2119-2147, Sept. CHEN, P.-H.; LIN, C.-J.; & SCHOLKOPF, B. 2005. A tutorial on v- support vector machines. Applied stochastic models in business and industry 21(2), 111-136, March/April, http://www.kyb.mpg.de/publications/pdfs/pdf3353.pdf Date of access: 6 Jan. 2008. COVER, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers. 14(3):326-334, Jun. CORTES, C. & VAPNIK, V.N. 1995. Support vector networks. Machine learning. 20(3):273-297, Sept. COUTURE, B., ed. 1986. Functional approaches to writing: research perspectives. Norwood, NJ: Ablex. 271 p. CRISTIANINI, N. & SHAWE-TAYLOR, J. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. 189 p. CROWSTON, K. & KWASNIK, B. 2004. A framework for creating a facetted classification for genres: addressing issues of multidimensionality. (In the Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS'04), Track 4, Waikoloa, Island of Hawaii, January 5-8. Los Alamitos, CA: IEEE Computer Society. 9 p.) http://csdl2.computer.org/comp/proceedings/hicss/2004/2056/04/205640100a.pdf Date of access: 6 Jan. 2008. DAELEMANS, W.; VAN DEN BOSCH, A.; & WEDTERS, T. 1997. IGTtree: using trees for compression and classification in lazy learning algorithms. Artificial intelligence review. l l( l-5):407^23, Feb. DAVIDSON, C. & TOMIC, A. 1999 (In Jones, C ; Turner, J.; & Street, B., eds. Students writing in the university: cultural and epistemological issues. Philadelphia: John Benjamins, p. 161-169.) REFERENCE LIST 154 DEWDNEY, N.; VANESS-DYKEMA, C; & MACMILLAN, R. 2001. The form is the substance: classification of genres in text. {In The Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management, Toulouse, France, July 6-7. Morristown, NJ: Association for Computational Linguistics, p . l - 8.) http://www.aclweb.org/anthology-newAVAV01AV01-1007.pdf Date of access: 6 Jan. 2008. EJTRICH, T. & LANG, B. 2006. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of computational and applied mathematics, 196(2):425-436, Nov. ENGLISH, F. 1999. What do students really say in their essays? Towards a descriptive framework for analysing student writing. {In Jones, C ; Turner, J.; & Street, B., eds. Students writing in the university: cultural and epistemological issues. Philadelphia: John Benjamins, p. 17-36.) EMDE, W. & WETTSCHERECK, D. 1996. Relational instance based learning. {In Saitta, L., ed. The Proceedings of the 13th International Conference on Machine Learning (ICML'96), Bari, Italy, July 3-6. San Francisco, CA: Morgan Kaufmann p. 122-130). ERIKSON, F. & SCHULTZ, J. 1982. The counsellor as gatekeeper: social interaction in interviews. New York, NY: Academic Press. 263 p. EXPERT ADVISORY GROUP ON LANGUAGE ENGINEERING STANDARDS (EAGLES). 1996. Preliminary recommendations on text typology. EAGLES Document EAG-TCWG-TTYP/P, June. http://www.ilc.cnr.it/EAGLES/texttyp/texttyp.html Date of access: 6 Jan. 2008. FINN, A. 2002. Machine learning for genre classification. Dublin: University College. (Thesis — M.Sc.) 133p. http://www.aidanf.net/publications/afthesis.ps Date of access: 6 Jan. 2008. FINN, A. & KUSHMERICK, N. 2003. Learning to classify documents according to genre. (Paper presented at the 18th International Joint Conference on Artificial Intelligence (IJCAI-03) Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico, August 9-15.) (Unpublished.) http://www.aidanf.net/publications/finn031earninggenre.pdf Date of access: 6 Jan. 2008. FINN, A., KUSHMERICK, N., & SMYTH, B. 2002. Genre classification and domain transfer for information filtering. {In Crestani, F.; Girolami, M.; & Van Rijsbergen, C.J., eds. Advances in information retrieval. The Proceedings of the 24th BCS-IRSG European Colloquium on IR Research Glasgow, U.K., March 25-27. Berlin: Springer, p. 353-362.) FLESCH, R. 1948. A new readability yardstick. Journal of applied psychology, 32(3): 221-233,Jun. REFERENCE LIST 155 FLOWERDEW, L. 2002. Corpus-based analyses in EAP. (In Flowerdew, J., ed. Academic discourse. Harlow: Pearson, p. 95-114.) FORSBOM, E. 2005. Feature extraction for genre classification. http://stp.lingfil.uu.se/~evafo/gslt/statmet/statmet05forsbom.pdf Date of access: 6 Jan. 2008. FRANCIS, W.N. & KUCERA, H. 1982. Frequency analysis of English usage: lexicon and grammar. Boston, MA: Houghton Mifflin. 561 p. FRIEDL, J. 1997. Mastering regular expressions: powerful techniques for Perl and other tools. Sebastopol, CA: O'Reilly. 342 p. GARSIDE, R. 1987. The CLAWS word-tagging system. (In Garside, R.; Leech, G.; & Sampson, G., eds. The computational analysis of English: a corpus-based approach. London: Longman, p. 30^11.) GARSIDE, R. & SMITH, N. 1997. A hybrid grammatical tagger: CLAWS4. (In Garside, R.; Leech, G.; & McEnery, A., eds. Corpus annotation: linguistic information from computer text corpora. London: Longman, p. 102-121.) GIMENEZ, J. (jgimenez@lsi.upc.edu) 12 Sept. 2006. Re:SVMTool. E-mail to: Raaff, S. (20226209@puknet.puk.ac.za). GIMENEZ, J. (jgimenez@lsi.upc.edu) 23 Jan. 2007. Character encoding help. E-mail to: Raaff, S. (20226209©student.nwu.ac.za). GIMENEZ, J. & MARQUEZ, L. 2004. SVMTool: a general POS tagger generator based on support vector machines. (In Lino, M.T.; Xavier, M.F.; Ferreira, F.; Costa, R.; & Silva, R., eds. The Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Lisbon, Portugal, May 26-28. Paris: ELDA. p. 43^-6). http://www.lsi.upc.edu/~nlp/SVMTool/lrec2004-gm.pdf Date of access: 6 Jan. 2008. GIMENEZ, J. & MARQUEZ, L. 2006. SVMTool: Technical Manual v 1.3, August. http://www.lsi.upc.edU/~nlp/SVMTool/SVMTool.vl.3.pdf Date of access: 6 Jan. 2008. GRANGER, S. 1994. The learner corpus: a revolution in applied linguistics. English today, 39(3): 25-29. GRANGER, S. & RAYSON, P. 1998. Automatic profiling of learner texts. (In Granger, S., ed. Learner English on computer. London: Longman, p. 119-131.) HALLIDAY, M.A.K. 1978. Language as social semiotic: the social interpretation of language and meaning. London: Arnold. 256 p. HALLIDAY, M.A.K. 1987. Spoken and written modes of meaning. (In Horowitz, R & Samuels, S.J., eds. Comprehending oral and written language. New York, NY: Academic press, p. 55-82.) REFERENCE LIST 156 HALLIDAY, M.A.K. & MATTHIESSEN, C.M.I.M. 2004. An introduction to functional grammar. 3rd ed. London: Arnold. 689 p. HAND, D. J. 1997. Construction and assessment of classification rules. Chichester: Wiley. 214 p. HASTIE, T; TIBSHIRANI, R; & FRIEDMAN, J. 2001. The elements of statistical learning: data mining, inference and prediction. New York, NY: Springer. 533 p. HECHTER, T. 2004. A comparison of support vector machines and traditional techniques for statistical regression and classification. Stellenbosch: Universiteit Stellenbosch. (Dissertation — M.Com.) 159 p. HENNING, J.G. 2006. Linking adverbials in first, second and foreign language English student writing corpora. Potchefstroom: Noordwes-Universiteit. (Dissertation — M.A.) 161 p. HOLMES, J. (jasper.holmes@gmail.com) 21 Feb. 2007a. Re: Help with BAWE missing data. E-mail to: Raaff, S. (20226209@student.nwu.ac.za). HOLMES, J. (jasper.holmes@gmail.com) 21 Feb. 2007b. Re: Help with BAWE missing data. E-mail to: Raaff, S. (20226209@student.nwu.ac.za). HSU, C.-W.; CHANG, C.-C; & LIN, C.-J. 2003. A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University (Taiwan), http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Date of access: 6 Jan. 2008. HYLAND, K. 1998. Hedging in scientific research articles. Amsterdam: John Benjamins. 307 p. HYLAND, K. 2002a. What do they mean? Questions in academic writing. Text, 22(4): 529-557. Available: Academic Search Premier. Date of access: 6 Jan. 2008. HYLAND, K. 2002b. Activity and evaluation: reporting practices in academic writing. (In Flowerdew, J., ed. Academic discourse. Harlow: Pearson, p. 115-130). HYLAND, K. 2004a. Patterns of engagement: dialogic features and L2 undergraduate writing. (In Ravelli, L.J. & Ellis, R.A., eds. Analysing academic writing: contextualised frameworks. London: Continuum, p. 5-23.) HYLAND, K. 2004b. Genre and second language writing. Ann Arbor, MI: University of Michigan Press. 244 p. HYLAND, K. & MILTON, J. 1997. Qualifications and certainty in LI and L2 students' writing. Journal of second language writing, 6(2): 183-205, May. Available: ScienceDirect. Date of access: 6 Jan. 2008. REFERENCE LIST 157 IVANIC, R.; CLARK, R.; & RMMERSHAW, R. 2000. What am I supposed to make of this? The messages conveyed to students by tutors' written comments. {In Lea, M.R & Stierer, B, eds. Student writing in higher education. Buckingham: SRHE and Open University Press, p. 47-65.) JACKSON, P. & MOULINJER, I. 2002. Natural language processing for online applications: text retrieval, extraction and categorization. Amsterdam: John Benjamins. 225 p. JOACHIMS, T. 1999a. Making large-scale SVM learning practical. {In Scholkopf, B.; Burges, C ; & Smola, A., eds. Advances in kernel methods — support vector learning. Cambridge, MA: MIT. p. 169-184.) JOACHIMS, T. 1999b. Transductive inference for text classification using support vector machines. {In Bratko, I. & Dzeroski, S., eds. The Proceedings of the 16th International Conference on Machine Learning (ICML'99), Bled, Slovenia, June 27- 30. San Francisco, CA: Morgan Kaufmann. p. 200-209.) http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/6683/http:zSzzSzw ww-ai.cs.uni- dortmund.dezSzDOKUMENTEzSzJoachims_99c.pdf/joachims99tr ansductive.pdf Date of access: 6 Jan. 2008. JOHANNESSON, E. & WALLSTROM, C. 1999. Automatic analysis and visualization of stylistic genres. {In Kakola,T.K., ed. The Proceedings of the 22nd Information Systems Research Seminar in Scandinavia (IRIS 22): Enterprise Architectures for Virtual Organisations, Keuruu, Finland, August 7-10. Jyvaskyla: Department of Computer Science and Information Systems, University of Jyvaskyla. p. 149-162.) http://www.cs.jyu.fi/~timokk/iris22/IRIS22_Volume2.pdf Date of access: 6 Jan. 2008. JOHANSSON, S., LEECH, G., & GOODLUCK, H. 1978. Manual of information to accompany the Lancaster-Oslo-Bergen corpus of British English, for use with digital computers. Oslo: Department English, Oslo University. 147 p. JURAFSKY, D. & MARTIN, J.H. 2000. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice Hall. 934 p. KARAKOULAS, G. & SHAWE-TAYLOR, J. 1999. Optimizing classifiers for imbalanced training sets. {In Kearns, M. S.; Solla, S. A.; & Cohn, D. A., eds. Advances in neural information processing systems, Vol. 11. Cambridge, MA: MIT. p. 253-259.) KARJALAINEN, A.; PAIVARINTA, T.; TYRVAINEN, P.; & RAJALA, J. 2000. Genre-based metadata for enterprise document management. {In the Proceedings of the 33rd Hawaii International Conference on System Sciences (HICSS'33), Vol. 3, Maui, Hawaii, January 4-7. Los Alamitos, CA: IEEE Computer Society. 10 p.) http://csdl2.computer.org/comp/proceedings/hicss/2000/0493/03/04933013.pdf Date of access: 6 Jan. 2008. REFERENCE LIST 158 KARLGREN, J. 1999. Stylistic experiments in information retrieval. (In Strzalkowski, T., ed. Natural language information retrieval. Dordrecht: Kluwer. p. 147-166.) KARLGREN, J. 2000. Stylistic experiments for information retrieval. Stockholm: Universitet Stockholms. (Dissertation — D.Phil.) 147 p. http://www.sics.se/~jussi/Artiklar/2000_PhD/thesis.ps.gz Date of access: 6 Jan. 2008. KARLGREN, J. & CUTTING, D. 1994. Recognizing text genres with simple metrics using discriminant analysis. (In the Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), Vol. 2, Kyoto, Japan, August 5-9. Morristown, NJ: Association for Computational Linguistics, p. 1071-1075.) http://eprints.sics.se/56/01/cmplglixcol.pdf Date of access: 6 Jan. 2008. KELIH, E.; ANTIC, G.; GRZYBEK, P.; & STADLOBER, E. 2005. Classification of author and/or genre? The impact of word length. (In Weihs, C. & Gaul, W., eds. Classification: the ubiquitous challenge. Berlin: Springer, p. 498-505.) KESSLER, B.; NUNBERG, G.; & SCHUTZE, H. 1997. Automatic detection of text genre. (In the Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'97), Madrid, Spain, July 7-12. Morristown, NJ: Association for Computational Linguistics, p. 32-38.) http://www.aclweb.Org/anthology-new/P/P97/P97-1005.pdf Date of access: 6 Jan. 2008. KOHONEN, T.; KANGAS, J.; LAAKSONEN, J.; & TORKKOLA, K. 1995. LVQ PAK: the learning vector quantization package v. 3.1. Technical Report A30, Helsinki University of Technology (Norway). http://citeseer.ist.psu.edU/cache/papers/cs/5716/http:zSzzSzwww.ele.etsmtl.cazSzacad emiquezSzele778zSzarticleszSzlvq_tr96.pdf/kohonen961vq.pdf Date of access: 6 Jan. 2008. KOPPEL, M.; AKIVA, N.; & DAGAN, I. 2003. A corpus-independent feature set for style-based text categorization. (Paper resented at the 18th International Joint Conference on Artificial Intelligence (IJCAI-03) Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico, August 9-15.) (Unpublished.) http://citeseer.ist.psu.edU/cache/papers/cs2/86/http:zSzzSzwww.cs.biu.ac.ilzSz~koppe lzSzpaperszSzstability-workshop.pdf/koppel03corpusindependent.pdf Date of access: 6 Jan. 2008. KOPPEL, M.; ARGAMON, S.; & SHEVIONI, A. 2003. Automatically categorizing written texts by author gender. Literary and linguistic computing, 17(4):401^112. KROON, R. S. 2003. Support vector machines, generalization bounds, and transduction. Stellenbosch: Universiteit van Stellenbosch. (Dissertation - M.Com.) 171 p. REFERENCE LIST 159 KUHN, H.W. & TUCKER, A.W. 1951. Nonlinear programming. (In Neymann, J., ed. The Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California, U.S.A., July 1 - August 12. Berkeley, CA: University of California Press, p. 481-492.) KWASNIK, B.H.; CROWSTON, K.; NILAN, M.; & ROUSSINOV, D. 2000. Identifying document genre to improve web search effectiveness. Bulletin of the American Society for Information Science and Technology, 27(2): 23-26, Dec/Jan. http://www.asis.org/Bulletin/Dec-01/kwasnikartic.html Date of access: 6 Jan. 2008. LAKOFF, G. 1973. Hedges: a study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic, 2(4): 458-508, Oct. LEA, M.R & STREET, B.V. 2000. Student writing and staff feedback in higher education: an academic literacies approach. (In Lea, M.R & Stierer, B, eds. Student writing in higher education. Buckingham: SRHE and Open University Press, p. 32- 46.) LEBART, L.; SALEM, A.; & BERRY, L. 1998. Exploring textual data. Dordrecht: Kluwer. 245 p. LEE, D. Y. W. 2001. Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle. Language learning and technology, 5(3): 37-72, Sept. http://llt.msu.edu/vol5num3/pdf/lee.pdf Date of access: 6 Jan. 2008. LEE, Y.-B. & MYAENG, S.-H. 2002. Text genre classification with genre-revealing and subject-revealing features. (In the Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15, Tampere, Finland. New York, NY: ACM Press, p.145-150.) http://ir.icu.ac.kr/papers/text_genre.pdf Date of access: 6 Jan. 2008. LEE, Y.-B. & MYAENG, S.-H. 2004. Automatic identification of text genres and their roles in subject-based categorization. (In the Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS'04), Track 4, Waikoloa, Island of Hawaii, January 5-8. Los Alamitos, CA: IEEE Computer Society. 10 p.) http://csdl2.computer.org/comp/proceedings/hicss/2004/2056/04/205640100b.pdf Date of access: 6 Jan. 2008. LEMAIRE, B. & DESSU, P. 2001. A system to assess the semantic content of student essays. Journal of educational computing research, 24(3):305-320. LEOPOLD, E. & KINDERMANN, J. 2002. How to represent texts in input space? Machine learning, 46(l-3):423-444, Jan. LILLIEFORS, H.W. 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318):399-402,Jun. REFERENCE LIST 160 LIN, Y.; LEE, Y.; & WAHBA, G. 2002. Support vector machines for classification in nonstandard situations. Machine learning, 46(1-3): 191-202, Jan. LIU, Y. & ZHENG, Y.F. 2006. FS_SFS: a novel feature selection method for support vector machines. Pattern recognition, 39(7):1333-1345, Jul. Available: ScienceDirect. Date of access: 6 Jan. 2008. LOUW, H. 2006. Standardising written feedback on L2 student writing. Potchefstroom: Noordwes-Universiteit. (Dissertation — M.A.) 218 p. LUHN, H.P. 1957. A statistical approach to mechanical encoding and searching of literary information. IBM journal of research and development, 1(4):309-317. MANNING, CD. & SCHUTZE, H. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT. 679 p. MARCUS, M.P.; SANTORINI, B.; & MARCINKIEWICZ, M.A. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational linguistics. 19(2): 313-330. MARTIN, J.R. 1985. Process and text: two aspects of human semiosis. (In Benson, J.D & Greaves, W.S., eds. Systemic perspectives on discourse. Norwood, NJ: Ablex. p. 248-274.) MCCLAVE, J.T. & SINCICH, T. 2000. Statistics. 8th ed. Upper Saddle River, NJ: Prentice-Hall. 484 p. MEYER ZU EISSEN, S. & STEIN, B. 2004. Genre classification of web pages: user study and feasibility analysis. (In Biundo, S.; Friihwirth, T.; & Palm, G., eds. KI 2004: advances in artificial intelligence. Berlin: Springer, p. 256-269.) MITCHELL, T.T. 1997. Machine learning. Boston, MA: McGraw-Hill. 414 p. MOREALE, E. & VARGAS-VERA, M. 2003. Genre analysis and the automated extraction of arguments from student essays. (In Christie, J., ed. The Proceedings of the 7th International Computer Assisted Assessment Conference (CAA-2003), Loughborough, U.K., July 8-9. Loughborough: Loughborough University, p.271- 286.) http://www.caaconference.com/pastConferences/2003/procedings/moreale.pdf Date of access: 6 Jan. 2008. MOSTELLER, F. & WALLACE, D. L. 1964. Inference and disputed authorship: the Federalist papers. Reading, MA.: Addison-Wesley. 287 p. NELSON, G. 2002. International Corpus of English markup manual for written texts. http://www.ucl.ac.uk/english-usage/ice/written.pdf Date of access: 6 Jan. 2008. NESI, H.; SHARPLING, G.; & GANOBCSIK-WILLIAMS, L. 2004. Student papers across the curriculum: designing and developing a corpus of British student writing. Computers and composition, 21(4): 439^150, Dec. Available: Science Direct. Date of access: 6 Jan. 2008. REFERENCE LIST 161 NOBLE, W.S. 2006. What is a support vector machine? Nature biotechnology, 24(12): 1565-1567, Dec. NYSTRAND, M. 1987. The role of context in written communication. (In Horowitz, R & Samuels, S.J., eds. Comprehending oral and written language. New York, NY: Academic Press, p. 197-214.) OAKES, M. P. 1998. Statistics for corpus linguistics. Edinburgh: Edinburgh University Press. 287 p. PLATT, J. 1999. Fast training of support vector machines using Sequential Minimal Optimization. (In Scholkopf, B.; Burges, C ; & Smola, A., eds. Advances in kernel methods — support vector learning. Cambridge, MA: MIT. p. 185-208.) QUINLAN, J.R. 1993. C4.5: programs for machine learning. San Mateo, CA: Morgan Kaufmann. 302 p. QUIRK, R. 1968. Essays on the English language: medieval and modern. London: Longman. 201 p. QUIRK, R., GREENBAUM, S., LEECH, G., & SVARTVIK, J. 1985. A comprehensive grammar of the English language. London: Longman. 1779 p. RAUBER, A. & MULLER-KOGLER, A. 2001. Integrating automatic genre analysis into digital libraries. (In Fox, E.A. & Borgman, C.L., eds. The Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'01), June 24-28, Roanoke, Virginia, U.S.A. New York, NY: ACM Press, p. 1-10.) http://www.ifs.tuwien.ac.at/ifs/research/pub_pdf/raujcdl01.pdf Date of access: 6 Jan. 2008. REHM, G. 2002. Towards automatic web genre identification. (In the Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS'02), Track 4, Waikoloa, Island of Hawaii, January 7-10. Los Alamitos, CA: IEEE Computer Society. 10 p.) http://csdl2.computer.org/comp/proceedings/hicss/2002/1435/04/14350101.pdf Date of access: 6 Jan. 2008. RIETVELD, T. & VAN HOUT, R. 2005. Statistics in language research: analysis of variance. Berlin: Mouton de Gruyter. 265 p. ROBERTS, P. 1960. The relation of Linguistics to the teaching of English. College English, 22(1): 1-9, Oct. ROSCH, E. 1975. Cognitive representations of semantic categories. Journal of experimental psychology. 104(3): 192-233, Sept. ROSS, D. & HUNTER, D. 1994. //- EYEBALL: an interactive system for producing stylistic descriptions and comparisons. Computers and the humanities. 28(1): 1-11. REFERENCE LIST 162 ROSSO, M.A. 2005. Using genre to improve web search. Chapel Hill, NC: University of North Carolina. (Dissertation - Ph.D.) 275 p. http://ils.unc.edu/~rossm/Rosso_dissertation.pdf Date of access: 6 Jan. 2008. ROUSSINOV, D.; CROWSTON, K.; NILAN, N.; KWASNK, B.; CAI, J.; & LIU, X. 2001. Genre based navigation on the web. (In the Proceedings of the 34th Hawaii International Conference on System Sciences (HICSS'01), Track 4, Maui, Hawaii, January 3-6. Los Alamitos, CA: IEEE Computer Society. 10 p.) http://csdl2.computer.org/comp/proceedings/hicss/2001/0981/04/09814013.pdf Date of access: 6 Jan. 2008. ROYSTON, J. P. 1982. An extension of Shapiro and Wilk's W test for normality to large samples. Applied statistics, 31(2): 115-124. SALKIND, N. 2004. Statistics for people who (think they) hate statistics. 2nd ed. Thousand Oaks, CA: Sage. 401 p. SANTINI, M. 2004a. A shallow approach to syntactic feature extraction for genre classification. (In Lee, M., ed. The Proceedings of the 7th Annual Colloquium for the UK Special-Interest Group for Computational Linguistics (CLUK 7), 6-7 January, Birmingham, U.K.) ftp://ftp.itri.bton.ac.uk/reports/ITRI-04-02.pdf Date of access: 6 Jan. 2008. SANTINI, M. 2004b. State-of-the-art on automatic genre identification. Technical report, University of Brighton (U.K.). ftp://ftp.itri.bton.ac.uk/reports/ITRI-04-03.pdf Date of access: 6 Jan. 2008. SANTINI, M. 2005a. Genres in formation? An exploratory study of Web pages using cluster analysis. (In the Proceedings of the 8th Annual Colloquium for the UK Special-Interest Group for Computational Linguistics (CLUK 8), January 11, Manchester, U.K.) http://www.itri.brighton.ac.uk/~Marina.Santini/ITRI-05-01 .pdf Date of access: 6 Jan. 2008. SANTINI, M. 2005b. Linguistic facets for genre and text type identification: a description of linguistically-motivated features. Technical Report, University of Brighton (U.K.). http://www.nltg.brighton.ac.uk/home/Marina.Santini/linguistic_facets_tech_rep.pdf Date of access: 6 Jan. 2008. SANTINI, M. 2005c. Clustering web pages to identify emerging textual patterns. (Poster presented at Rencontre des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL), June 6-10, Dourdan, France.) (Unpublished.) http://www.nltg.brighton.ac.uk/home/Marina.Santini/_poster_recital_horizontal.pdf Date of access: 6 Jan. 2008. REFERENCE LIST 163 SANTINI, M. 2006a. Common criteria for genre classification: annotation and granularity. (Paper delivered as part of the Workshop on Text-based Information Retrieval (TIR-06), Riva del Garda, Italy, August 29. p. 35^0) (Unpublished) http://www.uni-weimar.de/medien/webis/research/tir/tir-06/ Date of access: 6 Jan. 2008. SANTINI, M. 2006b. Web pages, text types, and linguistic features: some issues. ICAME journal. 30: 67-86, Apr. http://icame.uib.no/ij30/ij30-page67-86.pdf Date of access: 6 Jan. 2008. SAVILLE-TROIKE, M. 1982. The ethnography of communication: an introduction. Oxford: Blackwell. 209 p. SCHOLKOPF, B. & SMOLA, A.J. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge, MA: MFr. 624 p. SCHOLKOPF, B.; SMOLA, A.J.; WILLIAMSON, R. C ; & BARTLETT, P. L. 2000. New support vector algorithms. Neural computation, 12(5): 1207-1245, May. SCOTT, M. 2004. Oxford WordSmith Tools 4.0, Version 4.0.0.268. Oxford: Oxford University Press. SHAPIRO, S. S.; WILK, M. B.; & CHEN, H. J. 1968. A comparative study of various tests for normality. Journal of the American Statistical Association, 63(324): 1343- 1372, Dec. SHEPHERD, M.; WATERS, C ; & KENNEDY, A. 2004. Cybergenre: automatic identification of home pages on the web. The journal of web engineering, 3(3 & 4): 236-251. http://users.cs.dal.ca/~shepherd/pubs/JWE040722.pdf Date of access: 6 Jan. 2008. SJEGEL, S. & CASTELLAN, N.J. JR. 1988. Nonparametric statistics for the behavioural sciences. 2nd ed. New York, NY: McGraw-Hill. 399 p. SIGLEY, R. 1997. Text categories and where you can stick them: a crude formality index. International journal of corpus linguistics, 2(2): 199-237. SMITH, E.A. & KINCAID, P. 1970. Derivation and validation of the automated readability index for use with technical materials. Human factors, 12:457^-64. SMITH, N. 1997. Improving a tagger. (In Garside, R.; Leech, G.; & McEnery, A., eds. Corpus annotation: linguistic information from computer text corpora. London: Longman, p. 137-150.) SPARCK JONES, K. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1): 11-21. REFERENCE LIST 164 STAMATATOS, E.; FAKOTAKIS, N.; & KOKKINAKIS, G. 2000a. Text genre detection using common word frequencies. (In the Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), Vol. 2, Saarbriicken, Germany, July 31 - August 4. San Francisco, CA: Morgan Kaufmann. p. 808-814.) http://delivery.acm.Org/10.l 145/1000000/992763/p808- stamatatos.pdf ?key 1 =992763&key2=0100279911 &coll=GUIDE&dl=GUIDE&CFID =11357499&CFTOKEN=78694650 Date of access: 6 Jan. 2008. STAMATATOS, E.; FAKOTAKIS, N.; & KOKKINAKIS, G. 2000b. Automatic text categorisation in terms of genre and author. Computational linguistics, 26(4):471- 495, Dec. STATSOFT, Inc. 2006. STATISTICA (data analysis software system), Version 7.1. STATSOFT, Inc. 2004. STATISTICA electronic manual. STATISTICA (data analysis software system), Version 7. STECKTNG, R. & SCHEBESCH, K.B. 2003. Support vector machines for credit scoring: comparing to and combining with some traditional classification methods. (In Schader, M.; Gaul, W.; & Vichi, M., eds. Between data science and applied data analysis. Berlin: Springer, p. 604-612.) SUC. 1997. SUC 1.0 Stockholm Umea Corpus, Version 1.0. Department of Linguistics, Umea University and Department of Linguistics, Stockholm University. http://www.ling.su.se/DaLi/suc/sucl.O_info.html Date of access: 6 Jan. 2008. SVARTVIK, J. & QUIRK, R., eds. 1980. A corpus of English conversation. Lund: C.W.K. Gleerup. 893 p. SWALES, J.M. 1990. Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press. 260 p. TAPANAINEN, P. & JARVINEN, T. 1997. A non-projective dependency parser. (In the Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP'97), Washington, DC, U.S.A., March 31 - April 3. San Francisco, CA: Morgan Kaufmann. p. 64-71.) http://portal.acm.org/ft_gateway.cfm?id=974568&type=pdf Date of access: 6 Jan. 2008. TEUFEL, S. & MOENS, M. 1999. Discourse-level argumentation in scientific articles: human and automatic annotation. (In the Proceedings of the Association for Computational Linguistics (ACL'99) Workshop Towards Standards and Tools for Discourse Tagging, College Park, MD, U.S.A., June 22. Morristown, NJ: Association for Computational Linguistics.) http://www.aclweb.org/anthology- new/W/W99/W99-031 l.pdf Date of access: 6 Jan. 2008. TRUSHKINA, J. 2006. Automatic error detection in second language learners' writing. Language matters, 37(2): 141-159. REFERENCE LIST 165 TRUSS, L. 2003. Eats, shoots & leaves: the zero tolerance approach to punctuation. London: Profile. 209 p. TULDAVA, J. 1995. Methods in quantitative linguistics. Trier: WVT Wissenschaftlicher. 187 p. TURNER, J. 1999. Academic literacy and the discourse of transparency. (In Jones, C ; Turner, J.; & Street, B., eds. Students writing in the university: cultural and epistemological issues. Philadelphia: John Benjamins, p. 150-160.) UCREL. 2006. CLAWS part-of-speech tagger. http://www.comp.lancs.ac.uk/ucrel/claws/ Date of access: 6 Jan. 2008. VAN DE POEL, K. 2006. Scribende: academic writing for students. Acco: Leuven. 175 p. VAN HALTEREN, H.; BAA YEN, R. H.; TWEEDIE, F.; HAVERKORT, M.; & NELTT, A. 2005. New machine learning methods demonstrate the existence of a human stylome. Journal of quantitative linguistics, 12(1): 65-77, Apr. VAN RUSBERGEN, C.J. 1975. Information retrieval. London: Butterworths. 208 p. VAPNIK, V.N. 1982. Estimation of dependencies based on empirical data. Berlin: Springer. 399 p. VAPNIK, V.N. 2000. The nature of statistical learning theory. 2nd ed. New York: Springer. 314 p. VAPNIK V.N & CHERVONENKIS, A. 1981. The necessary and sufficient conditions for uniform convergence of means to their expectations. Theory of probability and its applications, 26(3):532-553. VAPNIK V.N & LERNER, A. 1963. Pattern recognition using generalized portrait method. Automation and remote control, 24:774-780. WASTHOLM, P.; KUSMA, A.; & MEGYESI, B. B. 2005. Using linguistic data for genre classification. (In Funk P.; Rognvaldsson T.; & Xiong N., eds. The Proceedings of Swedish Artificial Intelligence Society and the Swedish Society for Learning Systems (SAIS-SSLS 2005), Malardalen University, Vasteras, Sweden, April 12-14.) http://stp.ling.uu.se/~bea/wastholm-megyesi-sais05.pdf Date of access: 6 Jan. 2008. WEISS, G.M. & PROVOST, F. 2003. Learning when training data are costly: the effect of class distribution on tree induction. Journal of artificial intelligence research, 19:315-354, July/Dec. WESTON, J.; MUKHERJEE, S.; CHAPELLE, O.; PONNTIL, M.; POGGIO, T.; & VAPNIK, V. 2001. Feature selection for SVMs. (In Solla, S.A.; Leen, T.K.; & Muller, K.-R. Advances in neural information processing systems, Vol. 13. Cambridge, MA: MIT. p. 668-674.) REFERENCE LIST 166 WHITE, M.; CARDIE, C.;&NG,V. 2002. Detecting discrepancies in numeric estimates using multidocument hypertext summaries. (In the Proceedings of the 2nd International Conference on Human Language Technology Research, San Diego, California, U.S.A., March 24-27. San Francisco, CA: Morgan Kaufmann. p. 336- 341.). http://www.hlt.utdallas.edu/~vince/papers/hlt02.pdf Date of access: 6 Jan. 2008. WITTEN, I. & FRANK, E. 2000. Data mining: practical machine learning tools with Java implementations. San Francisco, CA: Morgan Kaufmann. 371 p WITTGENSTEIN, L. 1953. Philosophical investigations. Oxford: Blackwell. 232 p WOLTERS, M. & KIRSTEN, M. 1999. Exploring the use of linguistic features in domain and genre classification. (In the Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, Bergen, Norway, June 8-12. Morristown, NJ: Association for Computational Linguistics, p. 142- 149.) http://delivery.acm.Org/10.l 145/980000/977055/pl42- wolters.pdf?keyl=977055&key2=2820279911&coll=GUIDE&dl=GUIDE&CFID=l 1357924&CFTOKEN=24954924 Date of access: 6 Jan. 2008. YANG, Y. & PEDERSEN, J. O. 1997. A comparative study on feature selection in text categorization. (In Fisher, D.H., ed. The Proceedings of the 14th International Conference on Machine Learning (ICML'97), Nashville, Tennessee, U.S.A., July 8- 12. San Francisco, CA: Morgan Kaufmann. p. AX2-MQ.) http://citeseer.ist.psu.edU/cache/papers/cs/1982/http:zSzzSzwww.cs.cmu.eduzSz~yim ingzSzpapers.yyzSzml97.pdfZyang97comparative.pdf Date of access: 6 Jan. 2008. CORPUS INFORMATION 167 APPENDIX 1 Corpus information A 1.1. Subjects Biochemistry International Relations Business Law Comparative American Studies Literature Computer Science Management Science Economics Mathematics English and Theatre Medicine European Industrial Relations Philosophy Film and Literature Politics French Psychology German Sociology History Theatre and Performance Studies Table A 1.1.1: List of departments/courses from which essays used are drawn A1.2. Essays used in this project Essay number Grade Department/subject 0001a 71 Sociology 0001b 67 Sociology 0001c 68 Sociology OOOld 75 Sociology OOOle 67 Sociology 0002a 70 Sociology 0002b 71 Sociology 0004a 68 Sociology 0004b 75 Sociology 0004c 65 Sociology 0004d 67 Sociology 0005a 69 History APPENDIX 1 168 Essay number Grade Department/subject 0005b 72 History 0005c 74 History 0006a 85 Biological Sciences 0006c 65 Biological Sciences 0009d 85 Biological Sciences 0009e 65 Biological Sciences 0010a 66 History 0010b 65 History 0010c 73 History OOlOd 70 History OOlOe 72 Sociology 0011a 75 Psychology 0011b 75 Psychology 0011c 81 Psychology 001 Id 75 Psychology 001 le 75 Psychology 001 If 85 Psychology 001 lg 65 Psychology 0012a 70 History 0012b 72 History 0012c 70 Comparative American Studies 0012d 65 History 0013a 68 History 0013b 73 History 0013c 67 History 0013d 65 History 0014a 65 Psychology 0014b 65 Psychology 0014c 65 Psychology 0014d 85 Psychology 0014e 65 Psychology 0015a 68 History CORPUS INFORMATION 169 Essay number Grade Department/subject 0015b 66 History 0016a 74 Psychology 0016c 72 Psychology 0017a 65 Psychology 0017b 65 Psychology 0019a 74 History 0019b 71 History 0019c 68 History 0019d 66 History 0019e 72 History 0019f 64 History 0019g 68 History 0019h 72 History 0019i 65 History 0019j 65 History 0020b 65 Psychology 0020c 75 Psychology 0020d 70 Psychology 0020e 68 Psychology 0020f 68 Psychology 0020g 68 Psychology 0020h 68 Psychology 0021c 75 Engineering 0022a 68 Psychology 0022b 65 Psychology 0022c 75 Psychology 0024a 67 Theatre and Performance Studies 0024b 71 Theatre and Performance Studies 0024c 68 Theatre and Performance Studies 0024d 67 Theatre and Performance Studies 0024e 71 Theatre and Performance Studies 0024f 67 Theatre and Performance Studies APPENDIX 1 170 Essay number Grade Department/subject 0024g 65 Theatre and Performance Studies 0024h 65 Theatre and Performance Studies 0026a 68 Philosophy 0026b 68 Philosophy 0029a 67 History 0029b 66 History 0029c 68 Comparative American Studies 0029f 65 History 0029h 65 Comparative American Studies 0029i 65 History 0029J 65 History 0029k 65 History 00291 65 Comparative American Studies 0029m 65 Comparative American Studies 0029n 65 Comparative American Studies 0029o 65 History 0029p 65 Comparative American Studies 0030a 66 History 0030b 66 History 0031a 75 Psychology 0031b 85 Psychology 0031c 75 Psychology 0032a 85 Biological Sciences 0033a 84 Psychology 0033b 85 Psychology 0033c 90 Psychology 0033d 75 Psychology 0033e 75 Psychology 0034a 65 Politics 0034b 65 Politics 0034c 85 Politics 0034e 67 Politics CORPUS INFORMATION 171 Essay number Grade Department/subject 0037a 72 Psychology 0037b 68 Psychology 0038a 69 Philosophy 0038b 65 Philosophy 0038c 69 Philosophy 0039a 67 History 0039b 66 History 0039d 65 History 0039e 64 History 0039f 64 History 0040a 70 History 0040b 67 History 0040d 67 History 0040e 68 History 0044a 70 History 0044b 73 History 0045a 79 Biological Sciences 0048a 85 Medicine 0053a 70 Economics 0053b 70 Economics 0053d 68 Economics 0058f 68 Economics 0063a 68 History and French 0063b 68 History and French 0063c 63 History 0063d 65 History and French 0064a 80 Law 0064b 68 Law 0064e 78 Law 0064f 70 Law 0069c 65 Law 0069d 65 Law APPENDIX 1 172 Essay number Grade Department/subject 0073a 74 Business 0073b 68 Business 0073c 73 Business 0073d 72 Business 0075a 76 Sociology 0075c 70 Politics 0075d 68 Politics 0075e 71 Politics 0075f 73 Politics 0075g 61 Sociology 0075i 63 Sociology 0075j 65 Politics 0075k 68 Politics 00751 66 Politics 0075m 72 Politics 0082a 68 Psychology 0082b 65 Psychology 0082c 62 Psychology 0082d 68 Psychology 0082e 80 Psychology 0082f 62 Psychology 0082g 68 Psychology 0082i 65 Psychology 0082J 75 Psychology 0084a 85 Psychology 0098a 65 History 0098b 68 History 0098c 68 Theatre 0117a 72 Economics 0118a 65 Economics 0119c 69 Law 0119d 70 Law CORPUS INFORMATION 173 Essay number Grade Department/subject 0119e 65 Law 0119g 67 Law 0126b 68 English and Theatre 0126c 68 English and Theatre 0126d 73 English and Theatre 0126f 72 English and Theatre 0129a 65 Comparative American Studies 0129b 71 Comparative American Studies 0129c 63 History 0129d 70 Comparative American Studies 0129e 64 History 0129f 72 History 0129g 68 History 0129h 63 History 0129i 70 Comparative American Studies 0129J 68 Comparative American Studies 0130a 72 Comparative American Studies 0130b 69 Comparative American Studies 0130c 70 Comparative American Studies 0130d 70 Comparative American Studies 0130e 68 Comparative American Studies 0130f 72 Comparative American Studies 0130g 70 Comparative American Studies 0130h 68 Comparative American Studies 0130i 70 Comparative American Studies 0130j 73 Comparative American Studies 0131a 68 Psychology 0131b 65 Psychology 0135a 80 Politics 0135b 80 Politics 0135c 74 Politics 0135d 70 Politics APPENDIX 1 174 Essay number Grade Department/subject 0135e 64 Politics 0135f 85 Politics 0135g 65 Politics 0135h 70 Politics 0135i 71 Politics 0137a 66 History 0137b 67 History 0137c 60 Politics 0137e 69 Politics 0137f 71 Politics 0137h 65 Politics 0140a 65 Sociology 0140b 65 Sociology 0140c 65 Sociology 0140d 65 Sociology 0140f 65 Sociology 0143b 65 Law 0143c 65 Law 0143d 62 Law 0144a 65 History 0144b 71 History 0144c 67 History 0144d 66 History 0144e 68 History 0150a 65 Theatre 0150b 65 Theatre 0150c 65 Theatre 0169d 70 Business 0171a 60 Psychology 0171c 61 Psychology 0177b 85 Philosophy 0179a 85 Sociology CORPUS INFORMATION 175 Essay number Grade Department/subject 0179b 65 Sociology 0179c 65 Sociology 0179d 65 Sociology 0179e 65 Sociology 0179f 65 Sociology 0179g 65 Sociology 0179h 65 Sociology 0179i 85 Sociology 0179J 65 Sociology 0181b 75 Physics 0190a 68 Psychology 0190b 68 Psychology 0190c 75 Psychology 0190d 68 Psychology 0191b 71 Law 0191c 65 Law 0192a 62 Psychology 0192b 62 Psychology 0192c 62 Psychology 0202a 70 Economics 0202b 65 Economics 0202c 74 Economics 0202d 63 Economics 0202e 70 Economics 0202f 64 Economics 0202g 65 Economics 0202h 62 Economics 0202i 67 Economics 0202m 70 Business 0202n 74 Business 0209a 70 Law 0209b 70 Law APPENDIX 1 176 Essay number Grade Department/subject 0209c 61 Law 0209e 70 Law 0209g 68 Law 0215b 70 Philosophy 0215c 67 Philosophy 0215d 68 Philosophy 0215e 62 Philosophy 0224a 64 Film and Literature 0224c 65 Film and Literature 0224d 72 Film and Literature 0224e 70 Film and Literature 0228c 65 Computer Science 0235a 72 Philosophy 0235b 78 Philosophy 0235c 63 Philosophy 0235d 74 Philosophy 0238a 62 Psychology 0238b 60 Psychology 0238c 62 Psychology 0238d 62 Psychology 0238e 60 Psychology 0240a 62 Psychology 0240b 75 Psychology 0240c 80 Psychology 0240d 65 Psychology 0240e 68 Psychology 0244d 62 Comparative American Studies 0244e 64 History 0244f 65 History 0244g 63 History 0244h 64 Comparative American Studies 0244i 63 Comparative American Studies CORPUS INFORMATION 177 Essay number Grade Department/subject 0244J 64 Politics 0244k 70 Politics 02441 63 Politics 0244m 60 Politics 0244n 61 Politics 0247a 65 Biological Sciences 0252a 67 Comparative American Studies 0252b 70 Comparative American Studies 0252c 67 History 0252d 67 Sociology 0252e 70 Sociology 0252f 64 Sociology 0252g 67 Sociology 0252h 69 Sociology 0252J 70 Sociology 0252k 72 Sociology 02521 64 Sociology 0252m 64 Sociology 0252n 73 History 0252o 64 Comparative American Studies 0252p 66 Comparative American Studies 0252q 67 Sociology 0252r 72 History 0252s 70 Comparative American Studies 0252t 67 History 0255a 65 Politics 0255b 65 Politics 0255c 67 Politics 0255d 64 Politics 0255e 64 Politics 0255f 70 History 0255g 64 History APPENDIX 1 178 Essay number Grade Department/subject 0255h 70 History 0259a 70 Philosophy 0259b 70 Philosophy 0259c 80 Philosophy 0259d 73 Philosophy 0260a 65 Medicine 0260c 65 Medicine 0262a 75 Psychology 0262b 75 Psychology 0262c 68 Psychology 0262d 65 Psychology 0262e 67 Psychology 0262f 65 Psychology 0262g 68 Psychology Table Al.2.1: List of essays from the BAWE and their grading in percentage LINGUISTIC FEATURES 179 APPENDIX 2 Linguistic features A2.1. Parts-of-speech Tag Description Examples CC Coordinating conjunction and, but, either or, or CD Cardinal number includes roman numerals and fractions: 1, one, IV, two-thirds DT Determiner includes articles and determiners: this, the, an, any, another EX Existential there there (was a party in progress) FW Foreign word102 in vivo, a priori, et cetera, IN Preposition or subordinating conjunction in, of, if, although JJ Adjective includes ordinal numbers -.first, green, mammary, acidic JJR Adjective, comparative103 with the comparative ending er with a strictly comparative meaning: greener, more, less, greater JJS Adjective, superlative104 with superlative ending est and superlative meaning: greenest, most, least, worst LS105 List item marker L,(a) MD Modal includes possibility, necessity and predictive modals: can, ought, shall, may NN Noun, singular or mass re-enactment, flour, dog, subclass NNP Proper noun, singular John, UN, Brighton, Christmas NNPS Proper noun, plural Labradors NNS Noun, plural dogs, subclasses PDT Pre-determiner pre-qualifiers: quite, rather, such; pre-quantifiers: all, half, many, nary; and both. Foreign words would include e.g. and i.e. but this has not been done for this study. 103 More and less are tagged JJR when alone or used as adjectives. 104 Most and least are tagged JJS when alone or used as adjectives. 105 It was decided to exclude LS as a variable for the final classifier development, as it is not reliably tagged. APPENDIX 2 180 Tag Description Examples POS Possessive ending nouns ending in 's: John's, Labradors', UN's PRP Personal pronoun includes subject, object, reflexive pronouns, and the impersonal pronoun: /, me, it, himself PRP$ Possessive pronoun my, your, mine, yours RB Adverb includes most words ending in ly, degree words, post head-modifiers, negative markers and nominal adverbs: quite, enough, never, mildly RBR Adverb, comparative Adverbs with the comparative ending er with a strictly comparative meaning: better RBS Adverb, superlative Adverbs with the comparative ending est with a strictly superlative meaning: best RP Particle {tell) off, {run) up, {break) through SYM 106 Symbol Should be used for mathematical, scientific or technical symbols: >, +, TO To to UH Interjection uh, well, yes, my VB Verb, base form subsumes imperatives, infinitives and subjunctives: sing, be, do, have VBD Verb, past tense includes the conditional form of the verb to be: sang, was, did, had VBG Verb, gerund or present participle singing, being, doing, having VBN Verb, past participle sung, been, done, had VBP Verb, non-3rd person singular present sing, am, are, do, have VBZ Verb, 3rd person singular present sings, is, does, has WDT W7i-determiner when used as a relative pronoun: which, that WP W7i-pronoun what, who, whom WP$ Possessive w/i-pronoun whose It was decided to exclude SYM as a variable for the final classifier development, as it is not reliably tagged. LINGUISTIC FEATURES 181 Tag Description Examples WRB W/i-adverb how, where, why Table A2.1.1: List of Penn Treebank part-of-speech tags Tag Description Examples APPGE Possessive pronoun, pre-nominal my, your, our AT Article the, no ATI Singular article a, an, every BCL Before-clause marker in order (that), in order (to) CC Coordinating conjunction and, or CCB Adversative coordinating conjunction but CS Subordinating conjunction if, because, unless, so, for CSA As (as conjunction) as CSN Than (as conjunction) than CST That (as conjunction) that CSW Whether (as conjunction) whether DA After-determiner or post- determiner capable of pronominal function such, former, same DAI Singular after-determiner little, much DA2 Plural after-determiner few, several, many DAR Comparative after-determiner more, less, fewer DAT Superlative after-determiner most, least, fewest DB Before determiner or pre­ determiner capable of pronominal function all, half DB2 Plural before-determiner both DD Determiner (capable of pronominal function) any, some DD1 Singular determiner this, that, another DD2 Plural determiner these, those DDQ W/j-determiner which, what DDQGE W/j-determiner, genitive whose APPENDIX 2 182 Tag Description Examples DDQV Wft-ever determiner, whichever, whatever EX Existential there there FO Chemical and mathematical formulae IQR/s FU Unclassified word FW Foreign word in vivo, a priori, et cetera GE Germanic genitive marker ','s IF For (as preposition) for n General preposition in, if, although 10 0/(as preposition) of IW With, without (as prepositions) with, without JJ General adjective first, green, mammary, acidic JJR General comparative adjective with the comparative ending er with a strictly comparative meaning: older, better, stronger JJT General superlative adjective with superlative ending est and superlative meaning: oldest, best, strongest JK Catenative adjective able in, be able to, willing in, be willing to MC Cardinal number, neutral for number two, three MC1 Singular cardinal number one MC2 Plural cardinal number sixes, sevens MCGE*107 Genitive cardinal number, neutral for number two's, 100's MCMC Hyphenated number 40-50,1770-1827 MD Ordinal number first, second, next, last MF Fraction, neutral for number quarters, two-thirds ND1 Singular noun of direction north, southeast NN Common noun, neutral for number sheep, cod, headquarters NN1 Singular common noun book, girl Please note that all terms marked with an asterisk indicate terms that were either not found in the texts or which occurred in less than 1 percent of the texts. Therefore, these terms do not form part of the data used. LINGUISTIC FEATURES 183 Tag Description Examples NN2 Plural common noun books, girls NNA Following noun of title MA. NNB Preceding noun of title Mr., Prof. NNL1 Singular locative noun island, street NNL2* Plural locative noun islands, streets NNO Numeral noun, neutral for number dozen, hundred NN02 Numeral noun, plural hundreds, thousands NNT1 Temporal noun, singular day, week, year NNT2 Temporal noun, plural days, weeks, years NNU Unit of measurement, neutral for number in, cc NNU1 Singular unit of measurement inch, centimetre NNU2 Plural unit of measurement ins., feet NP Proper noun, neutral for number IBM, Andes NP1 Singular proper noun London, Jane, Frederick NP2 Plural proper noun Browns, Reagans, Koreas NPD1 Singular weekday noun Sunday NPD2* Plural weekday noun Sundays NPM1 Singular month noun October NPM2* Plural month noun Octobers PN Indefinite pronoun, neutral for number none PN1 Indefinite pronoun, singular anyone, everything, nobody, one PNQO Objective w/i-pronoun whom PNQS Subjective w/i-pronoun who PNQV W/i-ever pronoun whoever PNX1 Reflexive indefinite pronoun oneself PPGE Nominal possessive personal pronoun mine, yours PPH1 3 rd person singular neuter personal pronoun it APPENDIX 2 184 Tag Description Examples PPH01 3 rd person singular objective personal pronoun him, her PPH02 3 rd person plural objective personal pronoun them PPHS1 3 rd person singular subjective personal pronoun he, she PPHS2 3 rd person plural subjective personal pronoun they PPIOl 1 st person singular objective personal pronoun me PPI02 1 st person plural objective personal pronoun us PPIS1 1 st person singular subjective personal pronoun I PPIS2 1 st person plural subjective personal pronoun we PPX1 Singular reflexive personal pronoun yourself, itself PPX2 Plural reflexive personal pronoun yourselves, themselves PPY 2nd person personal pronoun you RA Adverb, after nominal head else, galore REX Adverb introducing Appositional constructions namely, e.g. RG Degree adverb very, so, too RGQ Wh- degree adverb how RGQV W7i-ever degree adverb however RGR Comparative degree adverb more, less RGT Superlative degree adverb most, least RL Locative adverb alongside, forward RP Preposition adverb, particle about, in RPK Preposition adverb, catenative about in, be about to RR General adverb quite, enough, never, mildly RRQ Wh- general adverb where, when, why, how RRQV W7i-ever general adverb wherever, whenever LINGUISTIC FEATURES 185 Tag Description Examples RRR Comparative general adverb Adverbs with the comparative ending er with a strictly comparative meaning: better, longer RRT Superlative general adverb Adverbs with the comparative ending est with a strictly superlative meaning: best, longest RT Quasi-nominal adverb of time now, tomorrow TO Infinitive marker to UH Interjection oh, yes, um VBO* Be, base form subsumes imperatives and subjunctives: be VBDR Were were VBDZ Was was VBG Being being VBI Be, infinitive to be or not, it will be VBM Am am VBN Been been VBR Are are VBZ Is is VDO Do, base form subsumes imperatives and subjunctives: do VDD Did did VDG Doing doing VDI Do, infinitive I may do, to do VDN Done done VDZ Does does VHO Have, base form subsumes imperatives and subjunctives: have VHD Had, past tense had VHG Having having VHI Have, infinitive have VHN Had, past participle had VHZ Has has VM Modal, auxiliary can, will, would APPENDIX 2 186 Tag Description Examples VMK Modal, catenative ought, used W O Base form of lexical verb give, work VVD Past tense of lexical verb gave, worked VVG ing Participle of lexical verb giving, working VVGK ing Participle, catenative going in, be going to VVI Infinitive to give, it will work VVN Past participle of lexical verb given, worked VVNK Past participle, catenative bound in be bound to vvz s Form of lexical verb gives, works XX Analytic negation not, n 't ZZ1 Singular letter of the alphabet A,b ZZ2* Plural letter of the alphabet A's,b's Table A2.1.2: List of UCREL CLAWS7 part-of-speech tags A2.2. Punctuation marks Tag Punctuation mark colon : exclamation mark / full stop. question mark ? semi-colon; Table A2.2.1: List of punctuation tags A2.3. Quotations Tag Description Quotations integrated with text Non-English quotations integrated with text Quotations non-integrated with text LINGUISTIC FEATURES 187 Tag Description Non-English quotations non-integrated with text Sentence counts for non-integrated quotations Table A2.3.1: List of quotation tags A2.4. Nominalisations Tag Nominalisational suffix-singular Tag Nominalisational suffix - plural ism isms ity ities ment merits ness nesses tion tions Table A2.4.1: List of nominalisational suffixes and their respective tags A2.5. Text statistics 1. word count; 2. word length in characters; 3. long words (> 6 characters); 4. type/token ratio; 5. sentence count; 6. sentence length in words; 7. paragraph count; 8. paragraph length in sentences; 9. readability score: the Automated Readability Index and the Flesch Reading Ease scores; and 10. number of references (tagged ). APPENDIX 2 188 A2.6. Key function words also fundamentally merely seemingly through although furthermore more significantly thus amongst hence moreover similarly to arguably highly necessarily simply toward as however nonetheless socially towards because importantly not specifically ultimately between in of such upon concerning increasingly often than whether consequently indeed perhaps that which conversely infact previously the whilst despite inherently primarily their within due itself purely themselves yet essentially largely rather therefore established lastly regarding these firstly many secondly this Table A2.6.1: List of the key function words of the top 1000 key words A2.7. Most frequent words in the BNC a by if she was all for in so we an from is that were and had it the when are has more their which as have not there who at he of they will be her on this with been his one to would but I or up you Table A2.7.1: List of the top fifty words in the written section of the BNC LINGUISTIC FEATURES 189 A2.8. Prepositions against by minus* per via amid despite notwithstanding than versus amidst during of through with among except off throughout within amongst for on to without at from onto toward besides in opposite towards between into out upon Table A2.8.1: List of simple prepositions Tag Preposition Tag Preposition according to nearer to* next to along with onto apart from out of as for outside of as of owing to as per* preliminary to* as to preparatory to* away from previous to* because of prior to close to pursuant to* contrary to regardless of devoid of save for* due to subsequent to* except for* thanks to exclusive of* up against inside of* up to instead of upwards of* irrespective of void of APPENDIX 2 190 Tag Preposition Tag Preposition near to Table A2.8.2: List of two-word complex prepositions Tag Preposition Tag Preposition as far as in need of at the expense of in place of at the hands of in process of* at variance with* in quest of* by dint of* in reference to by means of in regard to* by virtue of in relation to by way of in respect of* for sake of* in respect to* for the sake of in return for for want of* in search of from want of* in spite of in accordance with in the face of in addition to in the light of in aid of* in the process of in back of* in view of* in case of* on account of* in common with* on behalf of in comparison to on ground of* in comparison with* on grounds of* in compliance with* on pain of* in conformity with* on the ground of* in consequence of* on the grounds of in contact with* on the matter of* in exchange for on the part of LINGUISTIC FEATURES 191 Tag Preposition Tag Preposition in face of* on the strength of* in favour of on top of in front of with reference to in lieu of* with regard to in light of with respect to in line with with the exception of Table A2.8.3: List of three-word complex prepositions A2.9. Reporting verbs acknowledge comment exclaim* proclaim submit add complain* explain promise suggest admit concede forecast* pronounce* swear* affirm confess* foretell* protest testify* agree confide* guarantee remark vow* allege* confirm hint repeat warn announce* contend insist reply write argue convey maintain report assert declare mention retort* boast* deny object say claim disclose predict state Table A2.9.1: List of public factual verbs accept determine forget mean reflect anticipate discern gather note remember ascertain discover guess* notice reveal assume doubt hear observe see believe dream hold perceive show calculate ensure hope presume signify check establish imagine presuppose* suppose* conclude estimate imply pretend suspect APPENDIX 2 192 conjecture* expect indicate prove think consider fancy* infer realise understand decide fear insure* reason deduce feel judge recall deem find know reckon* demonstrate foresee* learn recognise Table A2.9.2: List of private factual verbs allow demand intend prefer resolve arrange* desire move pronounce rule ask enjoin* ordain* propose* stipulate* beg* entreat* order recommend* urge command grant pledge* request vote decree* instruct pray* require Table A2.9.3: List of suasive verbs confirm discuss Table A2.9.4: List of miscellaneous reporting verbs seem appear Table A2.9.5: List of perception verbs A2.10. Conjunctions again eventually instead nonetheless similarly alternatively finally last notwithstanding subsequently altogether first likewise originally then consequently firstly meanwhile otherwise therefore conversely furthermore moreover overall third e.g. hence namely rather thirdly else however nevertheless second thus LINGUISTIC FEATURES 193 equally i.e. next secondly viz.* Table A2.10.1: List of conjunctive adjuncts/conjuncts/linking adverbials Tag Conjunct Tag Conjunct as a consequence in comparison as a result in conclusion by comparison* in consequence* by contrast in contrast first of all* in other words for example in particular for instance in sum* in addition in summary in any case in the first place in any event* in the second place* Table A2.10.2: List of multi-word conjunctive adjuncts/conjuncts/linking adverbials although once unless whereas whilst as since until whereby because though when whereupon* if till whenever while Table A2.10.3: List of subordinating conjunctions and but or Table A2.10.4: List of coordinating conjunctions A2.11. Downtoners almost just partially scarcely virtually barely little partly simply but merely practically slightly enough nearly quite somewhat APPENDIX 2 194 hardly only rather sufficiently Table A2.11.1: List of downtoners A2.12. Stance adverbs accordingly frankly* mainly reportedly* typically confidentially* generally speaking technically figuratively* honestly* strictly truthfully* Table A2.12.1: List of non-factual stance adverbs actually certainly indeed never really always definitely inevitably obviously undoubtedly Table A2.12.2: List of factual stance adverbs Tag Adverb Tag Adverb in fact without doubt of course no doubt Table A2.12.3: List of two-word factual stance adverbs apparently most cases* possibly roughly evidently most instances* predictably* sort of kind of perhaps probably Table A2.12.4: List of likelihood stance adverbs amazingly* disturbingly* ironically sensibly wisely* astonishingly* hopefully* regrettably* surprisingly conveniently* fortunately rightly unbelievably* curiously* importantly sadly unfortunately Table A2.12.5: List of attitudinal stance adverbs LINGUISTIC FEATURES 195 A2.13. Stance adjectives acceptable curious hopeful* odd* thankful* adamant* depressed horrible* okay* tragic advisable* desirable hurt paradoxical typical afraid disappointed* imperative peculiar unacceptable alarmed* dissatisfied* incidental* pleased unaware amazed* distressed* inconceivable preferable uncomfortable amazing* disturbed incredible reassured* understandable amused* dreadful indisputable relieved* unfair angry embarrassing interesting ridiculous unfortunate* annoyed* encouraged ironic sad unhappy annoying* essential irritated* satisfied unlucky* anomalous extraordinary lucky sensible unthinkable* appropriate fitting mad shocked untypical* astonished* fortunate natural shocking unusual aware frightened neat silly* upset awful* funny* necessary sorry* upsetting* careful glad* nice* strange vital concerned good notable stupid wonderful* conceivable grateful* noteworthy sufficient worried critical great noticeable surprised* crucial happy obligatory surprising Table A2.13.1: List of attitudinal stance adjectives accepted confident false positive sure apparent convinced impossible proved true certain correct inevitable plain well-known* clear evident obvious right Table A2.13.2: List certainty/factual stance adjectives APPENDIX 2 196 doubtful probable likely unlikely possible Table A2.13.3: List of likelihood stance adjectives apt likely certain prone due sure guaranteed unlikely liable Table A2.13.4: List of certainty stance adjectives able disposed hesitant* prepared sufficient anxious doomed inclined quick unable bound eager insufficient ready unwilling careful eligible keen reluctant welcome competent fit loath* set willing determined greedy* obliged slow Table A2.13.5: List of ability/willingness stance adjectives afraid concerned embarrassed* impatient* puzzled* amazed* content free indignant* relieved* angry curious furious* nervous sorry* annoyed* delighted* glad* perturbed* surprised* ashamed* disappointed* grateful pleased worried astonished* disgusted* happy proud Table A2.13.6: List of personal affective stance adjectives difficult pleasant easier possible easy tough* hard unpleasant LINGUISTIC FEATURES 197 impossible Table A2.13.7: List of ease/difficulty stance adjectives awkward* criminal important reasonable useless appropriate cumbersome improper right unreasonable* bad desirable inappropriate safe unseemly* best dreadful interesting sick unwise* better essential logical silly* vital brave expensive lucky smart* wise careless foolhardy* mad stupid wonderful* convenient fruitless necessary surprising worse crazy* good nice* useful wrong Table A2.13.8: List of evaluation stance adjectives A2.14. Nouns assertion discovery knowledge realization statement conclusion doubt observation realisation conviction fact principle result Table A2.14.1: List of factual stance nouns assumption feeling indication probability thesis belief hypothesis notion rumour* claim idea opinion sign contention implication possibility suggestion expectation impression presumption suspicion Table A2.14.2: List of likelihood stance nouns comment proposition requirement news remark proposal report Table A2.14.3: List of non-factual stance nouns APPENDIX 2 198 ground view hope thought reason Table A2.14.4: List of attitudinal stance nouns agreement determination opportunity reluctance threat authority duty plan responsibility wish commitment failure potential right willingness confidence inclination promise scheme decision intention proposal temptation desire obligation readiness tendency Table A2.14.5: List of controlling nouns DATA PREPARATION AND ANNOTATION 199 APPENDIX 3 Data preparation and annotation A3.1. Data cleaning afraid curious importantly perhaps this amongst desirable impossible possible through angry desire in promise to as despite indeed reason unlikely at doubt likely right vital between dreadful lucky simply which but due mad stupid whilst by essential merely sufficient with careful for more surprising within certain from necessary than worried claim good not that concerned happy of the confirm hope on their Table A3.1.1: List of terms with multiple occurrences in the word lists A3.2. SVMTool The usage for SVMTool is: SVMTagger [options] (Gimenez & Marquez, 2006:29). This research project used: ./bin/SVMTagger -V 1 -S LRL -T 4 SVMT.eng/WSJTP < "file to be tagged" > "tagged file", making use of the following options: - V (verbose) was set to 1, - S (tagging direction) was set to LRL (left-to-right and right-to-left), T (tagging strategy) was set to 4, which is claimed to be robust against unknown words (Gimenez, & Marquez, 2006:30). The choice of these options was informed by Gimenez (2006); however, many other options are available, which can be viewed in Gimenez and Marquez (2006). SUPPORT VECTOR MACHINES 200 APPENDIX 4 Support vector machines A4.1. The optimal hyperplane classifier and the hard margin classifier The simplest form of SVMs make use of a special hyperplane classifier called the maximal margin classifier, which separates only linearly separable data. Consider the case where St is (x1,y1),...,(xe,ye), xeW, and _ye {+1,-1}, where the values of y indicate binary classification. Thus, the input x is assigned to a class y = —l if / ( x ) < 0 , or to y = +l if / ( x ) > 0 . This function, f(x), is a linear function of XE X and is given by Cristianini and Shawe-Taylor (2000:9): f(x) = {w-x)+b « (A4.1) = Z.wixi+b where (w,&)eRnxM are the parameters controlling / ( x ) ; w is the weight vector normal to the hyperplane and b is the threshold, which moves the hyperplane parallel to its former position. The decision rule, which governs the estimate of the target function / ( x ) , is sgn( / (x)) , where sgn(0) = 1. The decision rule, which governs the estimate / \ / / w / w f! 0 ifx<0,l of the target function f{x), is sgn^/(x)J, sgn(/(x)J:=< . >. This situation I -I- .11. -A. — V ' [ can be more easily understood if interpreted geometrically as the training space is divided into two (classes) by the hyperplane defined by Equation A4.1. The maximal margin (optimal) hyperplane is that hyperplane that separates the data points (feature vectors) without error and such that the distance between the plane and the nearest vector is maximised. The functional margin of a training instance (x;, _y.) is defined with respect to the hyperplane (w,b ) as (Cristianini & Shawe-Taylor, 2000:11): r,=y/((w-x,.>+fc). (A4.2) This definition encompasses an inherent degree of freedom in that w and b can be rescaled from (w,6) to (2W,/MJ), 2 e R+, without changing the function, while APPENDIX 4 201 making the functional margin bigger. The solution to this situation implies constraining the value of w. The geometric margin is defined as the functional margin of a normalised108 weight vector and it remains unchanged if w and b are scaled as it is scaled by |w| , where the length of w is defined in terms of the Euclidean notion of length (Cristianini & Shawe-Taylor, 2000:94-95; also Scholkopf & Smola, 2002:192). In order to determine the maximal margin hyperplane the geometric margin must be maximised; this can be done by minimising the weight vector's (Euclidean) norm and by making the functional margin equal to l.109 The latter ensures a constraint on the value of w; it implies that (w-x+\+b = +l, and (y/-x~\+b = -l,thus the geometric margin can be given by: =vcr(("^>-("--» 2|w 1 ilwll (A4.3) Note that if |w| is made a unit vector and thus of length equal to 1, then the functional and geometric margins will be the same. The above discussion leads to the actual optimisation problem here. The maximal margin hyperplane, which will successfully separate St by realising (A4.3), must minimise the objective function: 0 ( w ) = ( w w )=-IMI 0 are the Lagrange multipliers. The Lagrangian must be minimised with respect to w and b (the primal variables) and be maximised with respect to or. > 0 (the dual variables). The dual representation of the Lagrangian has the same optimal solution as the primal (the strong duality theorem, see Cristianini & Shawe-Taylor, 2000:85-86). The dual is relevant for SVMs as it has the number of variables equal to the £ rather than to the number of attributes, consequently reducing dimensionality (Abe, 2005:18). Thus, in order to find the dual Lagrangian, a saddle point must be found and the primal variables eliminated. A saddle point indicates the optimal solution of both the primal and dual problems values and is called the zero duality gap, which designates zero difference in solution values for both problems (Cristianini & Shawe-Taylor, 2000:86). According to the Kamsh-Kuhn-Tucker (KKT) conditions, the saddlepoint is the point at which the partial derivatives of L with respect to the primal variables must vanish. This is a minimum point if the function is convex.112 Such a point will thus ensure the minimisation of the primal variables. Following KKT, this saddlepoint is found by setting the partial derivatives of the primal variables to zero (Cristianini & Shawe- Taylor, 2000:95-96): dL(w,b,a) ^ n _, — i - = w - 2 , y,a,*i = 0, and (A4.7) 3w i=i aL -£y,,. L i,j=1 i,7=l i=l = h («) = E ai ~ ~ E ^ W " ; (V x;) • (A4.11) i=i ^ ij=i In this way, the Lagrangian variables can be used to solve the quadratic optimisation e problem, where (A4.ll) is subject to the constraints E ^ A = »^ anc* ai -^» ' = l»—» ;^ J=I and the weight vector w* = ^ y,-^*x; realises the optimal hyperplane with the geometric i=i margin of Equation A4.3, y=—— (Cristianini & Shawe-Taylor, 2000:96). w One of the relations in the KKT conditions, the KKT complementarity condition (Abe, 2005:310) is given by: ^[y ;((w*-X l . ) + ^ ) - l ] = 0 , i = l,...,l (A4.12) This relation allows for the calculation of b, which is not solved in the dual problem. Thus by selecting any i for which ai ^ 0 the value of b can be found (Burges, 1998:130-131). However, for precision,114 the mean of the calculated values for b (if at ^ 0 ) for positive or's is usually computed. Cristianini and Shawe-Taylor (2000:96; also Boser et al., 1992:147) calculate b* using the primal constraints as follows: 113 This dual problem is also known as the Wolfe dual problem (Hastie, Tibshirani, & Friedman, 2001:109). See also Scholkopf and Smola (2002:171-172) for more on this. 114 See Burges (1998:131), Hechter (2004:46), and Abe (2005:20). SUPPORT VECTOR MACHINES 204 maxv_ ,((w*-x ;)) + minv_, f(w*-x,\) b*= ^ dl ^ iZZ. (A4.13) 2 In addition to providing the value for b, the KKT condition given in (A4.12) also indicates that for ai >0 the constraint is active and for at =0 it is inactive.115 Support vectors {SVs) refer to training instances (x ;) for which the constraint is active (a, > 0;that is non-zero). Vapnik (1982) shows that if the data are separated using such an optimal hyperplane that the expectation value of the probability of classification error on a test set is bounded by the ratio of expected number of SVs and training vectors (St). The decision function for new data points (with each point e RN) is thus (Scholkopf & Smola, 2002:14): sgn(/(x))=Xx-<(w-x,.)+Z>*. (A4.14) i=\ Thus if sgn( / (x) )>0 , then the data point is classified as y = +l; similarly, if sgn ( / (x)) < 0, the data point is classified as y = -\ (Abe, 2005:20). A4.2. The soft margin classifier Cortes and Vapnik (1995) introduce a solution116 to the problem of noisy data by the introduction of the slack variables: £ >0, i = l,...J; and (A4.15) relaxed separation constraints: ^ . ( ( w - x J + ^ l - S , i = \,...,t (A4.16) The slack variables thus allow the constraints of the margin (A4.5) to be violated on condition that there is some increase in the value of the objective function (A4.4) at the optimal solution. Thus will allow the training set to be separated with some errors, but An equality constraint is active if the solution weight vector that satisfies the inequality constraint is equal to zero and is inactive if it does not (Cristianini & Shawe-Taylor, 2000:80). 1 See Scholkopf and Smola (2002: 204) for more potential solutions and some rebuttals. APPENDIX 4 205 without allowing too many misclassification errors, whilst ensuring that the maximal margin requirements are achieved. If £. is large enough, the constraint in (A4.16) can always be met. But this may result in a trivial solution with all ^. taking on large values. In order to prevent this ^ are penalised in (A4.4). As a particular ^ >1 is required for an error to occur, Y,i£i is an upper bound on the training errors (Scholkopf & Smola, 2002:16). The simplest way of including this in (A4.4) is called the C-SV classifier; the objective function is now given by (Scholkopf & Smola, 2002:205): $ ( W ^ ) 4 ( ^ W ) 4 K - C>0, (A4.17) subject to (A4.15) and (A4.16). For the L2-norm case, the primal Lagrangian for (A4.17) is given by Cristianini and Shawe-Taylor (2000:105): L p ( w , ^ o ) = i ( w w ) + f 2 5 2 - 2 a l [ y , ( ( w - x l ) + * ) - l + 5 ] . (A4.18) Similar to the optimisation problem in Section 3.7.2, the dual is calculated by setting the value of the partial derivatives to zero to obtain the relations (Cristianini & Shawe- Taylor, 2000:105): P \ „ , j = w - g y ^ » f =0, (A4.19) = C £ - a = 0, and (A4.20) - = i > , « ; = 0 . (A4.21) ob ,=1 The relations obtained in (A4.19), (A4.20) and (A4.21) are then substituted into Lp to obtain Ld (Cristianini & Shawe-Taylor, 2000:105): t i t i i 1 l 1 =LAa)=Xai-^yiyjaiaj{xrXj)-—(u-a}- 3w ^ (w,6,£, a) ^ ^ (w,6,£, « ) . SUPPORT VECTOR MACHINES 206 This is equivalent to: where StJ =1 if i = j and 0.117 Maximising this problem where (A4.22) is subject to e the constraints: ^ y.at = 0, and a ; >0, i = !,...,£; and the weight vector (as with the ;=i e fully separable case) w* = ^ y^x,. realises the maximal margin hyperplane of the ;=i geometric margin (Cristianini & Shawe-Taylor, 2000:106): i r 7 = Viesv Similar to the hard margin classifier, the KKT complementarity condition necessary for calculating b as discussed in Section 3.7.2 for this problem is given by Cristianini and Shawe-Taylor (2000:106): «» [y / ( (w*/ )+*) - l + £ ] = 0 ' i = \,...,L (A4.24) The value of b can thus be given by Abe (2005:38): / ( sA b = yi-JZ p-%., i = !,...,£, and (A4.28) £ >0, p>0. (A4.29) The Kronecker delta (see Abe, 2005:38). APPENDIX 4 207 For the v- SV classifier the primal Lagrangian for (A4.26) is given by Scholkopf and Smola (2002:207): 1 - t i t Lp(w,S,M,a,A0. (A4.30) Again, in order to establish the dual, the primal variables (w,^,b ,p ) are minimised and the dual variables maximised (a,j3,S). In the usual method, as exemplified above, by setting each primal to zero the following relations are obtained: w = 5>,.«,.X,., (A4.31) a,+fi=^ (A4.32) e 5>,)>.=0, and (A4.33) ;=i t YJcci~S = v. (A4.34) ;=i The dual is then obtained by substituting the above relations into (A4.29), which gives: LM) = ~\Y yiyjaiajK(xi,xJ), (A4.35) z I,J=I subject to the constraints: 0 0, which contain SVs whose dual variables are 0 < a{ < 1 and classes yt = ±1, gives: 1 e b = -— Y, Z f l r /^ (x ,x J ) , and 1 \XES+ J'=1 xeS_ j=l J (A4.40) (A4.41) The decision function is the same as that given in (A4.26). FEATURE SELECTION 209 APPENDIX 5 Feature selection Normal variables Non-normal variables T-test Feature Selection and Variable Screening module Mann-Whitney U Kolmogorov- Smirnov Two- Sample Test k=2 ARI readability vvz EX PDT Flesch readability vbdz PDT POS mean word length ARI readability POS VBD DT VBZ VBD VBG JJ to VBG find jj TO VBP recall NN JJ agree actually not a suggest never RP this find in fact there when forget result this JJS recall as to TO the reflect on to to NN command had vbdz k=3 signify their vvz Flesch readability similarly they Words >6 jj themselves up when of thus was JJ were SUCH evident full stop obvious ge mean word length bound io k=4 insufficient ppho2 jj keen pphs2 when unfortunately rl Note that the CLAWS7 tags are written in small letters and the Penn Treebank tags in capitals for easy distinction. Also note that words in capitals are from the keywords list, see Appendix 2. APPENDIX 5 210 Normal variables Non-normal variables there readiness rp words >6 easy vbdr RP pleasant vdd has unpleasant vdi VB vital vhd k=5 actually vhn jj inevitably Flesch readability never to obviously TO in fact a result WP statement ARI readability possibly mean sentence length sort of JJ suggestion when typically there content k=6 as for to as to TO on to jj by means of this for the sake of mean sentence length in favour of a in line with k=7 in search of jj had Flesch readability they of up this was k=8 were this eventually FEATURE SELECTION 211 Normal variables Non-normal variables when whenever a appear readiness jj eg to csw TO ex ARI readability fo of ge nnll nnl k=9 npdl vbdz ppho2 jj pphs2 RP ppio2 Flesch readability ppx2 to rp TO to k=10 vbdr vbdz vdd when vdi jj vhd is vhn this vvO of vvd TO vvg to CC Table A5.1: List of features selected NOTATION AND ABBREVIATIONS 212 NOTATION AND ABBREVIATIONS1 BOW bag-of-words POS part-of-speech SVM support vector machine G generator of the data S supervisor LM learning machine yeY output and output space xe X input and input space S t training set I number of observations in the training set xi observations y, class labels of observations V, margin oc Lagrange multipliers/dual variables Lp primal Lagrangian Ld dual Lagrangian S slack variables F feature space (s-z) inner (dot) product of x and z (j>\X^F mapping to feature space K(x,z) kernel (^(x)-0(z)) w weight vector b bias III p -norm R set of real numbers E+ set of real positive numbers E" set of natural numbers SV support vector c cost parameter RB recall of 'bad' examples RG recall of 'good' examples AR average recall of'bad' and 'good' examples TLN text-length normalised data RAW raw data LOG logarithm-transformed data IDF inverse document frequency data CV cross-validation MGR middle-band grades removed Bold and non-italicised text indicates a vector; italicised text indicates a scalar.