Influence of input matrix representation on topic modelling performance
Abstract
Topic models explain a collection of documents with
a small set of distributions over terms. These distributions over
terms define the topics. Topic models ignore the structure of
documents and use a bag-of-words approach which relies solely
on the frequency of words in the corpus.
We challenge the bag-of-word assumption and propose a
method to structure single words into concepts. In this way,
the inherent meaning of the feature space is enriched by more
descriptive concepts rather than single words. We turn to the
field of natural language processing to find processes to structure
words into concepts.
In order to compare the performance of structured features
with the bag-of-words approach, we sketch an evaluation framework
that accommodates different feature dimension sizes. This
is in contrast with existing methods such as perplexity, which
depend on the size of the vocabulary modelled and can therefore
not be used to compare models which use different input feature
sets. We use a stability-based validation index to measure a
model’s ability to replicate similar solutions of independent data
sets generated from the same probabilistic source. Stability-based
validation acts more consistently across feature dimensions than
perplexity or information-theoretic measures.
URI
https://researchspace.csir.co.za/dspace/bitstream/handle/10204/4712/de%20Waal_2010.pdf?sequence=1&isAllowed=yhttp://hdl.handle.net/10394/26554
Collections
- Faculty of Engineering [1136]