Language identification for proper name pronunciation
Abstract
The ability to predict the pronunciation of proper names is of importance to speech recognition applications that utilise names, such as directory enquiry systems. One of the factors that has been shown to improve the modelling of proper names, is the ability to identify the language of origin of a particular name. Proper names present specific challenges, which typically result in poor language identification (LID) accuracy: they are short, can be spelled in idiosyncratic ways and may have multiple language origins. In South Africa, the difficulty of identifying the language of origin of a name is exacerbated by two factors: co-existence of multiple languages and the scarcity of resources for model training. In this thesis, we first investigate existing LID approaches applicable to words in isolation, specifically focusing on those techniques that have been identified to produce high accuracy when resources are limited. We assess the strengths and weaknesses of existing LID techniques when applied to generic words and highlight various factors that influence the performance accuracy with which the language of individual words can be classified. A novel approach to LID of isolated words is then developed using an existing pronunciation modelling technique. Specifically, the LID task is recast as a pronunciation modelling task, and ‘joint sequence models’ are applied to obtain accurate single-word predictions. We evaluated the algorithm and found that the approach outperformed other conventional LID techniques in terms of identification accuracy, with low training data requirements. The results show that this new approach is able to reach identification accuracies greater than 97% on generic words. Given that suitable corpora for South African names were not available prior to the study, we developed two corpora as part of this work: the ‘Southern African corpus for multilingual name pronunciation’ (Multipron corpus) contains names in four languages (Afrikaans, English, Sesotho and isiZulu) as produced by speakers of the particular languages; the ‘South African directory enquiry’ (SADE) corpus contains a wide variety of names produced in a directory enquiries system, produced by speakers of the same four languages as above. When applying this technique to the above corpora, one finds that LID of proper names is a difficult task, but identification accuracy of over 80% was still obtained. In practice, there are cases where words belong to more than one language of origin. This has not been studied extensively (for either generic words or proper names), even though it is of practical importance. We investigate the ability of the proposed technique to perform LID of
multilingual words, specifically for under-resourced languages. This thesis concludes by investigating the implications of LID of proper names for pronunciation prediction by analysing G2P accuracy of dictionaries developed using the auto-generated LID information, as well as the recognition accuracy of an automatic speech recognition system developed using these dictionaries. We define a new G2P performance metric – bilateral VPA – which deals with variants in a way that is conceptually more consistent than existing performance metrics. We show that the new G2P accuracy measure correlates well with the ASR results observed. Based on an analysis of different approaches to dictionary creation, we provide guidelines for incorporating LID information during pronunciation modelling of proper names.