Show simple item record

dc.contributor.advisorPretorius, R.S.
dc.contributor.advisorVan Huyssteen, Gerhard B
dc.contributor.authorBrits, Jeanetta Hendrina
dc.date.accessioned2009-02-25T13:57:29Z
dc.date.available2009-02-25T13:57:29Z
dc.date.issued2006
dc.identifier.urihttp://hdl.handle.net/10394/1160
dc.descriptionThesis (M.A. (African Languages))--North-West University, Potchefstroom Campus, 2006.
dc.description.abstractWithin the context of natural language processing, a lemmatiser is one of the most important core technology modules that has to be developed for a particular language. A lemmatiser reduces words in a corpus to the corresponding lemmas of the words in the lexicon. A lemma is defined as the meaningful base form from which other more complex forms (i.e. variants) are derived. Before a lemmatiser can be developed for a specific language, the concept "lemma" as it applies to that specific language should first be defined clearly. This study concludes that, in Setswana, only stems (and not roots) can act independently as words; therefore, only stems should be accepted as lemmas in the context of automatic lemmatisation for Setswana. Five of the seven parts of speech in Setswana could be viewed as closed classes, which means that these classes are not extended by means of regular morphological processes. The two other parts of speech (nouns and verbs) require the implementation of alternation rules to determine the lemma. Such alternation rules were formalised in this study, for the purpose of development of a Setswana lemmatiser. The existing Setswana grammars were used as basis for these rules. Therewith the precision of the formalisation of these existing grammars to lemmatise Setswana words could be determined. The software developed by Van Noord (2002), FSA 6, is one of the best-known applications available for the development of finite state automata and transducers. Regular expressions based on the formalised morphological rules were used in FSA 6 to create finite state transducers. The code subsequently generated by FSA 6 was implemented in the lemmatiser. The metric that applies to the evaluation of the lemmatiser is precision. On a test corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained 70,96% and 70,52% respectively. Expressed in numbers the precision on 500 complex and simplex nouns was 78,45% and on complex and simplex verbs 79,59%. The quantitative achievement only gives an indication of the relative precision of the grammars. Nevertheless, it did offer analysed data with which the grammars were evaluated qualitatively. The study concludes with an overview of how these results might be improved in the future.
dc.publisherNorth-West University
dc.subjectComputational linguisticsen
dc.subjectSetswana grammaren
dc.subjectSetswana morphologyen
dc.subjectLemmatisationen
dc.subjectStemmingen
dc.subjectLemmaen
dc.subjectNatural language processingen
dc.subjectRegular expressionen
dc.subjectFinite state automataen
dc.subjectFinite state transduceren
dc.subjectFSA 6en
dc.titleOutomatiese Setswana lemma-identifiseringen
dc.typeThesisen
dc.description.thesistypeMasters


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record