Show simple item record

dc.contributor.advisorVan Huyssteen, G.B.
dc.contributor.advisorVan Zaanen, M.M.
dc.contributor.authorPuttkammer, Martin Johannes
dc.date.accessioned2009-02-18T06:20:19Z
dc.date.available2009-02-18T06:20:19Z
dc.date.issued2006
dc.identifier.urihttp://hdl.handle.net/10394/872
dc.descriptionThesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
dc.description.abstractAn important core technology in the development of human language technology applications is an automatic morphological analyser. Such a morphological analyser consists of various modules, one of which is a tokeniser. At present no tokeniser exists for Afrikaans and it has therefore been impossible to develop a morphological analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed, and the project therefore has two objectives: i)to postulate a tag set for integrated tokenisation, and ii) to develop an algorithm for integrated tokenisation. In order to achieve the first object, a tag set for the tagging of sentences, named-entities, words, abbreviations and punctuation is proposed specifically for the annotation of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to establish a larger, more specific tag set. The postulated tag set can also be simplified according to the level of specificity required by the user. It is subsequently shown that an effective tokeniser cannot be developed using only linguistic, or only statistical methods. This is due to the complexity of the task: rule-based modules should be used for certain processes (for example sentence recognition), while other processes (for example named-entity recognition) can only be executed successfully by means of a machine-learning module. It is argued that a hybrid system (a system where rule-based and statistical components are integrated) would achieve the best results on Afrikaans tokenisation. Various rule-based and statistical techniques, including a TiMBL-based classifier, are then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate named entities, the ∫-score rises to 94.74%. The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans sentencisation, named-entity recognition and tokenisation. The tokeniser will improve if it is trained with more data, while the expansion of gazetteers as well as the tag set will also lead to a more accurate system.
dc.publisherNorth-West University
dc.subjectAfrikaansen
dc.subjectTokenisationen
dc.subjectSentence recognitionen
dc.subjectNamed-entity recognitionen
dc.subjectSentenceen
dc.subjectNamed entityen
dc.subjectWorden
dc.subjectMorphological analysisen
dc.subjectNatural language processingen
dc.subjectComputational linguisticsen
dc.subjectTIMBLen
dc.titleOutomatiese Afrikaanse tekseenheididentifiseringafr
dc.typeThesisen
dc.description.thesistypeMasters
dc.contributor.researchID10215484 - Van Huyssteen, Gerhardus Beukes (Supervisor)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record