Dataselektering en –manipulering vir statistiese Engels–Afrikaanse masjienvertaling

McKellar, Cindy.

dc.contributor.advisor	Groenewald, H.J.
dc.contributor.advisor	Roux, J.C.
dc.contributor.author	McKellar, Cindy.	en_US
dc.date.accessioned	2012-10-23T13:21:40Z
dc.date.available	2012-10-23T13:21:40Z
dc.date.issued	2011	en_US
dc.identifier.uri	http://hdl.handle.net/10394/7626
dc.description	Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
dc.description.abstract	Die sukses van enige masjienvertaalsisteem hang grootliks van die hoeveelheid en kwaliteit van die beskikbare afrigtingsdata af. n Sisteem wat met foutiewe of lae–kwaliteit data afgerig is, sal uiteraard swakker afvoer lewer as n sisteem wat met korrekte of hoë–kwaliteit data afgerig is. In die geval van hulpbronarm tale waar daar min data beskikbaar is en data dalk noodgedwonge vertaal moet word vir die skep van parallelle korpora wat as afrigtingsdata kan dien, is dit dus baie belangrik dat die data wat vir vertaling gekies word, so gekies word dat dit teksgedeeltes insluit wat die meeste waarde tot die masjienvertaalsisteem sal bydra. Dit is ook in so n geval uiters belangrik om die beskikbare data so goed moontlik aan te wend. Hierdie studie stel ondersoek in na metodes om afrigtingsdata te selekteer met die doel om n optimale masjienvertaalsisteem met beperkte hulpbronne af te rig. Daar word ook aandag gegee aan die moontlikheid om die gewigte van sekere gedeeltes van die afrigtingsdata te verhoog om sodoende die data wat die meeste waarde tot die masjienvertaalsisteem bydra te beklemtoon. Alhoewel hierdie studie spesifiek gerig is op metodes vir dataselektering en –manipulering vir die taalpaar Engels–Afrikaans, sou die metodes ook vir toepassing op ander taalpare gebruik kon word. Die evaluasieproses dui aan dat beide die dataselekteringsmetodes, asook die aanpassing van datagewigte, n positiewe impak op die kwaliteit van die resulterende masjienvertaalsisteem het. Die uiteindelike sisteem, afgerig deur n kombinasie van verskillende metodes, toon n 2.0001 styging in die NIST–telling en n 0.2039 styging in die BLEU–telling.	en_US
dc.description.abstract	The success of any machine translation system largely depends on the amount and quality of the available training data. A system that is trained with faulty or low quality data, will necessarily yield poorer output than a system that was trained with correct, high quality data. In the case of resource scarce languages where there is very little data available, it becomes necessary for data to be translated in order to create parallel corpora for use as training data. In such cases it is critically important to choose the data that adds the most information to the machine translation system. Under those circumstances it is also important to make the best possible use of the available training data. This study therefore researches methods of training data selection in order to be able to train the optimal machine translation system with limited resources. The possibility of adapting the weights of certain parts of the training data, thereby emphasising the data that is the most valuable to the machine translation system, were also examined. Although this study was specifically conducted for the language pair English-Afrikaans, the methods investigated in this study could also be adapted to be used for other language pairs. The evaluation process indicates that both the data selection methods as well as the adaptation of the data weights have a positive impact on the quality of the resulting machine translation system. The final machine translation system showed a 2.0001 increase in the NIST score and a 0.2039 increase in the BLEU score.
dc.publisher	North-West University
dc.subject	Statistiese masjienvertaling	en_US
dc.subject	Masjienvertaling	en_US
dc.subject	Engels	en_US
dc.subject	Afrikaans	en_US
dc.subject	Dataselektering	en_US
dc.subject	Afgrigtingsdata	en_US
dc.subject	Statistical machine translation	en_US
dc.subject	Machine translation	en_US
dc.subject	English	en_US
dc.subject	Data selection	en_US
dc.subject	Training data	en_US
dc.title	Dataselektering en –manipulering vir statistiese Engels–Afrikaanse masjienvertaling	afr
dc.type	Thesis	en_US
dc.description.thesistype	Masters	en_US

Files in this item

Name:: McKellar_CA.pdf
Size:: 1.264Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Humanities [2671]

Show simple item record