Show simple item record

dc.contributor.advisorGroenewald, H.J.
dc.contributor.advisorRoux, J.C.
dc.contributor.authorMcKellar, Cindy.en_US
dc.descriptionThesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
dc.description.abstractDie sukses van enige masjienvertaalsisteem hang grootliks van die hoeveelheid en kwaliteit van die beskikbare afrigtingsdata af. n Sisteem wat met foutiewe of lae–kwaliteit data afgerig is, sal uiteraard swakker afvoer lewer as n sisteem wat met korrekte of hoë–kwaliteit data afgerig is. In die geval van hulpbronarm tale waar daar min data beskikbaar is en data dalk noodgedwonge vertaal moet word vir die skep van parallelle korpora wat as afrigtingsdata kan dien, is dit dus baie belangrik dat die data wat vir vertaling gekies word, so gekies word dat dit teksgedeeltes insluit wat die meeste waarde tot die masjienvertaalsisteem sal bydra. Dit is ook in so n geval uiters belangrik om die beskikbare data so goed moontlik aan te wend. Hierdie studie stel ondersoek in na metodes om afrigtingsdata te selekteer met die doel om n optimale masjienvertaalsisteem met beperkte hulpbronne af te rig. Daar word ook aandag gegee aan die moontlikheid om die gewigte van sekere gedeeltes van die afrigtingsdata te verhoog om sodoende die data wat die meeste waarde tot die masjienvertaalsisteem bydra te beklemtoon. Alhoewel hierdie studie spesifiek gerig is op metodes vir dataselektering en –manipulering vir die taalpaar Engels–Afrikaans, sou die metodes ook vir toepassing op ander taalpare gebruik kon word. Die evaluasieproses dui aan dat beide die dataselekteringsmetodes, asook die aanpassing van datagewigte, n positiewe impak op die kwaliteit van die resulterende masjienvertaalsisteem het. Die uiteindelike sisteem, afgerig deur n kombinasie van verskillende metodes, toon n 2.0001 styging in die NIST–telling en n 0.2039 styging in die BLEU–telling.en_US
dc.description.abstractThe success of any machine translation system largely depends on the amount and quality of the available training data. A system that is trained with faulty or low quality data, will necessarily yield poorer output than a system that was trained with correct, high quality data. In the case of resource scarce languages where there is very little data available, it becomes necessary for data to be translated in order to create parallel corpora for use as training data. In such cases it is critically important to choose the data that adds the most information to the machine translation system. Under those circumstances it is also important to make the best possible use of the available training data. This study therefore researches methods of training data selection in order to be able to train the optimal machine translation system with limited resources. The possibility of adapting the weights of certain parts of the training data, thereby emphasising the data that is the most valuable to the machine translation system, were also examined. Although this study was specifically conducted for the language pair English-Afrikaans, the methods investigated in this study could also be adapted to be used for other language pairs. The evaluation process indicates that both the data selection methods as well as the adaptation of the data weights have a positive impact on the quality of the resulting machine translation system. The final machine translation system showed a 2.0001 increase in the NIST score and a 0.2039 increase in the BLEU score.
dc.publisherNorth-West University
dc.subjectStatistiese masjienvertalingen_US
dc.subjectStatistical machine translationen_US
dc.subjectMachine translationen_US
dc.subjectData selectionen_US
dc.subjectTraining dataen_US
dc.titleDataselektering en –manipulering vir statistiese Engels–Afrikaanse masjienvertalingafr

Files in this item


This item appears in the following Collection(s)

Show simple item record