Benoemde–entiteitherkenning vir Afrikaans
Matthew, Gordon Derrac
MetadataShow full item record
According to the Constitution of South Africa, the government is required to make all the infor-mation in the ten indigenous languages of South Africa (excluding English), available to the public. For this reason, the government made the information, that already existed for these ten languages, available to the public and an effort is also been made to increase the amount of resources available in these languages (Groenewald & Du Plooy, 2010). This release of infor-mation further helps to implement Krauwer‟s (2003) idea that there is an inventory for the mini-mal number of language-related resources required for a language to be competitive at the level of research and teaching. This inventory is known as the "Basic Language Resource Kit" (BLARK). Since most of the languages in South Africa are resource scarce, it is of the best in-terest for the cultural growth of the country, that each of the indigenous South African languages develops their own BLARK. In Chapter 1, the need for the development of an implementable named entity recogniser (NER) for Afrikaans is discussed by first referring to the Constitution of South Africa’s (Republic of South Africa, 2003) language policy. Secondly, the guidelines of BLARK (Krauwer, 2003) are discussed, which is followed by a discussion of an audit that focuses on the number of re-sources and the distribution of human language technology for all eleven South African languages (Sharma Grover, Van Huyssteen & Pretorius, 2010). In respect of an audit conducted by Sharma Grover et al. (2010), it was established that there is a shortage of text-based tools for Afrikaans. This study focuses on this need for text-based tools, by focusing on the develop-ment of a NER for Afrikaans. In Chapter 2 a description is given on what an entity and a named entity is. Later in the chapter the process of technology recycling is explained, by referring to other studies where the idea of technology recycling has been applied successfully (Rayner et al., 1997). Lastly, an analysis is done on the differences that may occur between Afrikaans and Dutch named entities. These differences are divided into three categories, namely: identical cognates, non-identical cognates and unrelated entities. Chapter 3 begins with a description of Frog (van den Bosch et al, 2007), the Dutch NER used in this study, and the functions and operation of its NER-component. This is followed by a description of the Afrikaans-to-Dutch-converter (A2DC) (Van Huyssteen & Pilon, 2009) and finally the various experiments that were completed, are explained. The study consists of six experiments, the first of which was to determine the results of Frog on Dutch data. The second experiment evaluated the effectiveness of Frog on unchanged (raw) Afrikaans data. The following two experiments evaluated the results of Frog on “Dutched” Afrikaans data. The last two experiments evaluated the effectiveness of Frog on raw and “Dutched” Afrikaans data with the addition of gazetteers as part of the pre-processing step. In conclusion, a summary is given with regards to the comparisons between the NER for Afri-kaans that was developed in this study, and the NER-component that Puttkammer (2006) used in his tokeniser. Finally a few suggestions for future research are proposed.
- Humanities