Embedding recognized speech in a multilingual environment
Abstract
Word embeddings are widely used in natural language processing tasks. Most work
on word embeddings focuses on monolingual languages with large available datasets.
For embeddings to be useful in a multilingual environment, as in South Africa, the
training techniques have to be adjusted to cater for a) multiple languages, b) smaller
datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to
obtain datasets that include examples of natural code-switching, since code switching
is generally avoided in written material. A solution to this problem is to use
speech recognised data. Embedding packages such as Word2Vec and GloVe have default
hyper-parameter settings that are usually optimised for training on large datasets
and evaluation on analogy tasks. When using embeddings for problems such as text
classification in our multilingual environment, the hyper-parameters have to be optimised
for the specific data and task. We investigate the importance of optimising relevant
hyper-parameters for training word embeddings with speech recognised data,
where code-switching occurs, and evaluate against the real-world problem of classifying
radio and television recordings with code switching. In this dissertation we present
findings on the application of word embeddings to recognised speech in a multilingual
environment.
Collections
- Engineering [1403]