Outomatiese Afrikaanse tekseenheididentifisering

Puttkammer, Martin Johannes

dc.contributor.advisor	Van Huyssteen, G.B.
dc.contributor.advisor	Van Zaanen, M.M.
dc.contributor.author	Puttkammer, Martin Johannes
dc.date.accessioned	2009-02-18T06:20:19Z
dc.date.available	2009-02-18T06:20:19Z
dc.date.issued	2006
dc.identifier.uri	http://hdl.handle.net/10394/872
dc.description	Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
dc.description.abstract	An important core technology in the development of human language technology applications is an automatic morphological analyser. Such a morphological analyser consists of various modules, one of which is a tokeniser. At present no tokeniser exists for Afrikaans and it has therefore been impossible to develop a morphological analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed, and the project therefore has two objectives: i)to postulate a tag set for integrated tokenisation, and ii) to develop an algorithm for integrated tokenisation. In order to achieve the first object, a tag set for the tagging of sentences, named-entities, words, abbreviations and punctuation is proposed specifically for the annotation of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to establish a larger, more specific tag set. The postulated tag set can also be simplified according to the level of specificity required by the user. It is subsequently shown that an effective tokeniser cannot be developed using only linguistic, or only statistical methods. This is due to the complexity of the task: rule-based modules should be used for certain processes (for example sentence recognition), while other processes (for example named-entity recognition) can only be executed successfully by means of a machine-learning module. It is argued that a hybrid system (a system where rule-based and statistical components are integrated) would achieve the best results on Afrikaans tokenisation. Various rule-based and statistical techniques, including a TiMBL-based classifier, are then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate named entities, the ∫-score rises to 94.74%. The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans sentencisation, named-entity recognition and tokenisation. The tokeniser will improve if it is trained with more data, while the expansion of gazetteers as well as the tag set will also lead to a more accurate system.
dc.publisher	North-West University
dc.subject	Afrikaans	en
dc.subject	Tokenisation	en
dc.subject	Sentence recognition	en
dc.subject	Named-entity recognition	en
dc.subject	Sentence	en
dc.subject	Named entity	en
dc.subject	Word	en
dc.subject	Morphological analysis	en
dc.subject	Natural language processing	en
dc.subject	Computational linguistics	en
dc.subject	TIMBL	en
dc.title	Outomatiese Afrikaanse tekseenheididentifisering	afr
dc.type	Thesis	en
dc.description.thesistype	Masters
dc.contributor.researchID	10215484 - Van Huyssteen, Gerhardus Beukes (Supervisor)

Files in this item

Name:: puttkammer_martinj.pdf
Size:: 12.81Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Humanities [2671]

Show simple item record