The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages

Schlünz, Georg Isaac

The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages

Files

Schlunz_GI.pdf (983.35 KB)

Date

2010

Authors

Schlünz, Georg Isaac

Supervisors

Barnard, E.
Van Huyssteen, Gerhard B
Van Niekerk, D.R.

Publisher

North-West University

Abstract

In the world of human language technology, resource-scarce languages (RSLs) suffer from the problem of little available electronic data and linguistic expertise. The Lwazi project in South Africa is a large-scale endeavour to collect and apply such resources for all eleven of the official South African languages. One of the deliverables of the project is more natural text-to-speech (TTS) voices. Naturalness is primarily determined by prosody and it is shown that many aspects of prosodic modelling is, in turn, dependent on part-of-speech (POS) information. Solving the POS problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices. In a resource-scarce environment, obtaining and applying the POS information are not trivial. Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but state-of-the-art POS taggers are data-driven and thus require large amounts of labelled training data. Secondly, the subsequent processes in TTS that are used to apply the POS information towards prosodic modelling are resource-intensive themselves: some require non-trivial linguistic knowledge; others require labelled data as well. The first problem asks the question of which available POS tagging algorithm will be the most accurate on little training data. This research sets out to answer the question by reviewing the most popular supervised data-driven algorithms. Since literature to date consists mostly of isolated papers discussing one algorithm, the aim of the review is to consolidate the research into a single point of reference. A subsequent experimental investigation compares the tagging algorithms on small training data sets of English and Afrikaans, and it is shown that the hidden Markov model (HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset. Regarding the second problem, the question arises whether it is perhaps possible to circumvent the traditional approaches to prosodic modelling by learning the latter directly from the speech data using POS information. In other words, does the addition of POS features to the HTS context labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are trained from English and Afrikaans prosodically rich speech. The voices are compared with and without POS features incorporated into the HTS context labels, analytically and perceptually. For the analytical experiments, measures of prosody to quantify the comparisons are explored. It is then also noted whether the results of the perceptual experiments correlate with their analytical counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the addition of POS tags does improve the naturalness of the voice. However, the same effect can be accomplished by including segmental counting and positional information instead of the POS tags.

Description

Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.

Keywords

part-of-speech tagging, text-to-speech synthesis, resource-scarce language, naturalness, prosody, HTS context labels

URI

http://hdl.handle.net/10394/4944

Collections

Engineering

Full item page

The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages

Files

Date

Authors

Researcher ID

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Record Identifier

Abstract

Sustainable Development Goals

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By