Show simple item record

dc.contributor.advisorVan Vuuren, P.A.
dc.contributor.authorTembo, Francis Anesu Phineas
dc.date.accessioned2023-10-16T07:55:41Z
dc.date.available2023-10-16T07:55:41Z
dc.date.issued2023
dc.identifier.urihttps://orcid.org/0000-0003-1454-3164
dc.identifier.urihttp://hdl.handle.net/10394/42244
dc.descriptionMSc (Engineering Sciences with Computer and Electronic Engineering), North-West University, Potchefstroom Campusen_US
dc.description.abstractSpeaker diarisation is an essential upstream processing step in many speech processing applications. In child natural language environment analysis, it is used to partition long-format recordings into segments with respect to each speaker in the audio stream. After the daylong recording is partitioned into speaker segments, speech recognition algorithms can be used to determine the quality and quantity of speech input the child has experienced from their mother, father, siblings, or other caregivers. This information can be used to track a child's language understanding allowing researchers to investigate various topics related to their development. Long-format recordings present the least invasive way to capture speech data in a way that does not affect people’s behaviour, making them suitable to pick up interactions in all potential language use scenarios. Although the importance of this technology is apparent, accessibility to these tools has been a major hurdle for widespread adoption. The reason behind this is that speech technology involving child speech is quite hard to develop. Infants and children do not follow traditional speaking patterns, and the data is much more difficult and expensive to collect and annotate. In light of these limitations, this study aims to investigate how interpretability, a technique for explaining machine learning models, can be used to develop a diarisation model. This is accomplished by first exploring the literature on the topic of interpretability, analysing the broad desiderata related to it, and developing a taxonomy that consolidates the categorisation of interpretability methodologies. Since speaker diarisation is a speech processing task, interpretability is explored in the context of speech processing systems. The current literature on this topic suggests that the choice of interpretability techniques is dictated by the class of the underlying machine learning model and the nature of the input speech to which the model is trained. The study explores gender perception through the classification of speech samples in order to establish a foundation for how interpretability may be investigated for the diarisation model. This is investigated through two classes of deep neural network architectures. One network accepts input speech in the form of spectrograms in the time-frequency domain. The other network receives the input speech as raw waveforms in its time-domain representation. Interpreting these models requires that the explanations convey that the model has latched on to the idea of gender discriminative information, such as a difference in pitch and formant frequencies. Explanations for the spectrogram-trained model are generated using the expected gradients method. The attributions on the spectrogram highlight that the model focuses on the formants and the distances between them. The interpretation of the raw waveform model is done by determining the cumulative frequency of the filters in the first convolutional layer. This response is shown to favour spectral information relating to discriminative speaker information. The diarisation problem is investigated in a voice type classification setting, taking into consideration the class of speakers generally expected in the child’s natural language environment. Both the proposed VGG16 model trained on spectrograms and the SampleCNN model trained on raw waveforms are competitive compared to the existing state-of-the-art benchmark. The interpretations of these models are generated similarly to the gender perception problem, in which both models give an understanding of information used to differentiate between speakers. Although the VGG16 model performed better than the SampleCNN model, and its explanations are more human-readable, the interpretations of the VGG16 model via the expected gradients interpretability method did not yield any information as to how the model may be improved. On the other hand, the SampleCNN model has a similar performance to the current benchmark. The filters' cumulative frequency response in the first convolutional layer can be manipulated by varying the filter sizes to favour speaker-related spectral information. With respect to the input space, the expected gradients interpretations on the VGG16 provide local to semi-local interpretability. In contrast, the SampleCNN’s cumulative response provides global interpretability and is a direct interpretation of the internal representations learned by the network. When the diarisation model's decisions and or learnt representations can be explained and confirmed using domain-specific speaker discriminative information, the model can be regarded as interpretable. Depending on the nature of the diarisation model, these interpretations can be used in the same sense to actively influence the model's architecture or give suggestions regarding the nature of the input to improve performance.en_US
dc.language.isoenen_US
dc.publisherNorth-West University (South Africa).en_US
dc.subjectDiarisationen_US
dc.subjectInterpretabilityen_US
dc.subjectDeep neural networken_US
dc.subjectConvolutional neural networken_US
dc.subjectBidirectional long short-term memoryen_US
dc.subjectFundamental frequencyen_US
dc.subjectFormanten_US
dc.titleInterpretable speaker diarisation for child language environment analysisen_US
dc.typeThesisen_US
dc.description.thesistypeMastersen_US
dc.contributor.researchID10732926 - Van Vuuren, Pieter Andries (Supervisor)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record