Markus Konrad markus.konrad@wzb.eu
January 2018
Material will be available at: http://dsspace.wzb.eu/pyug/topicmodeling1/
Topic modeling is a method to discover abstract topics within a collection of documents.
Each collection of documents (corpus) contains a "latent" or "hidden" structure of topics. Some topics are more prominent in the whole corpus, some less. In each document there are multiple topics covered, each to a different amount.
The latent variable $z$ describes the topic structure, as each word of each document is thought to be implicitly assigned to a topic.
*Latent Dirichlet Allocation (LDA)* is a topic model that assumes:
Classic approach:
Has been used successfully for hypothesis testing (Fligstein et al 2017)
Original: "In the first World Cup game, he **sat** only on the bank."
Tokens: first, world, cup, game, he, sit, bank
document | russia | putin | soccer | bank |
---|---|---|---|---|
$D_1$ | 3 | 1 | 0 | 1 |
$D_2$ | 0 | 0 | 2 | 1 |
$D_3$ | 0 | 0 | 0 | 2 |
An LDA topic model can be described by two distributions:
Given this model, we can generate words:
Each topic has a distribution over all words in the corpus:
topic | russia | putin | soccer | bank | possible interpretation |
---|---|---|---|---|---|
topic 1 | 0.5 | 0.4 | 0.0 | 0.1 | russian politics |
topic 2 | 0.1 | 0.0 | 0.7 | 0.2 | soccer in russia |
topic 3 | 0.3 | 0.0 | 0.0 | 0.7 | russian economy |
→ describes the documents (which topics are important for them)
Each document has a different distribution over all topics:
document | topic 1 | topic 2 | topic 3 | possible interpretation |
---|---|---|---|---|
doc. 1 | 0.0 | 0.3 | 0.7 | mostly about soccer and russian economy |
doc. 2 | 0.9 | 0.0 | 0.1 | russian politics and a bit of economy |
doc. 3 | 0.3 | 0.5 | 0.2 | all three topics |
direct calculation not possible (involves $K^n$ terms with $n$ being num. tokens in corpus) (Griffiths & Steyvers 2004)
estimations either with:
both are rather complicated algorithms
Algorithm uses iterative resampling
First: initialize a random $Z$ – each word is randomly assigned to a topic
Finally determine $\phi$ and $\theta$ from current $Z$.
* depends on corpus size (~1000 iterations is usually fine – log-likelihood must converge)
supplementary tm_toy_lda
notebook shows step-by-step implementation
* unless your random number generator is manually set to a certain state and your data and hyperparameters are the same
Note: x-axis is number of topics, y-axis is normalized scale for different "model quality" metrics
Problem: Your data does not meet the assumptions for LDA, e.g.:
Solutions:
Problem: You do not know which priors to set (num. topics $K$, $\alpha$, $\beta$)
Problem: A lot of uninformative words appear in my topics
Problem: Some topics are very general and uninformative
Problem: Computing the models takes very long
Problem: Out of memory errors (corpus is too big for memory)