Exploring minimal pronunciation modeling for low resource languages

Barnard, EtienneVan Heerden, CharlHartmann, WilliamKarakos, DamianosSchwartz, RichardTsakalidis, StavrosDavel, Marelie H.Exploring minimal pronunciation modeling for low resource languagesIOS Press Inc2015Spoken term detectionGraphemic systemsPronunciation lexiconsMy UniversityMy University2018-03-022018-03-022015enPresentationMarelie Davel, Damianos Karakos, Etienne Barnard, Charl van Heerden, Richard Schwartz and Stavros Tsakalidis, William Hartmann, “Exploring minimal pronunciation modeling for low resource languages”, in Proc. Interspeech, pp 538-542, Dresden, Germany, 2015. [http://engineering.nwu.ac.za/multilingual-speech-technologies-must/publications]978-1-61499-700-9https://books.google.co.za/books?id=-RGhDQAAQBAJ&pg=PA44&lpg=PA44&dq=Exploring+minimal+pronunciation+modeling+for+low+resource+languages&source=bl&ots=wAYDYAm_Ju&sig=ha5BMCtwoEBjHQTAkyauz2wSSEc&hl=en&sa=X&ved=0ahUKEwjFwPDv1M3ZAhUlKsAKHXrICPkQ6AEIODAC#v=onepage&q=Exploring%20minimal%20pronunciation%20modeling%20for%20low%20resource%20languages&f=falsehttps://www.lti.cs.cmu.edu/sites/default/files/sitaram%2C%20sunayana.pdfhttp://hdl.handle.net/10394/26488Pronunciation lexicons can range from fully graphemic (modeling each word using the orthography directly) to fully phonemic (first mapping each word to a phoneme string). Between these two options lies a continuum of modeling options. We analyze techniques that can improve the accuracy of a graphemic system without requiring significant effort to design or implement. The analysis is performed in the context of the IARPA Babel project, which aims to develop spoken term detection systems for previously unseen languages rapidly, and with minimal human effort. We consider techniques related to letter-to-sound mapping and language-independent syllabification of primarily graphemic systems, and discuss results obtained for six languages: Cebuano, Kazakh, Kurmanji Kurdish, Lithuanian, Telugu and Tok Pisin.