Topic modeling is a method to discover abstract topics within a collection of documents.
Each collection of documents (corpus) contains a "latent" or "hidden" structure of topics. Some topics are more prominent in the whole corpus, some less. In each document there are multiple topics covered, each to a different amount.
The latent variable $z$ describes the topic structure, as each word of each document is thought to be implicitly assigned to a topic.
*Latent Dirichlet Allocation (LDA)* is a topic model that assumes:
Has been used successfully for hypothesis testing (Fligstein et al 2017)
Original: "In the first World Cup game, he **sat** only on the bank."
Tokens: first, world, cup, game, he, sit, bank
An LDA topic model can be described by two distributions:
Given this model, we can generate words:
Each topic has a distribution over all words in the corpus:
|topic 1||0.5||0.4||0.0||0.1||russian politics|
|topic 2||0.1||0.0||0.7||0.2||soccer in russia|
|topic 3||0.3||0.0||0.0||0.7||russian economy|
→ describes the documents (which topics are important for them)
Each document has a different distribution over all topics:
|document||topic 1||topic 2||topic 3||possible interpretation|
|doc. 1||0.0||0.3||0.7||mostly about soccer and russian economy|
|doc. 2||0.9||0.0||0.1||russian politics and a bit of economy|
|doc. 3||0.3||0.5||0.2||all three topics|
direct calculation not possible (involves $K^n$ terms with $n$ being num. tokens in corpus) (Griffiths & Steyvers 2004)
estimations either with:
both are rather complicated algorithms
Algorithm uses iterative resampling
First: initialize a random $Z$ – each word is randomly assigned to a topic
Finally determine $\phi$ and $\theta$ from current $Z$.
* depends on corpus size (~1000 iterations is usually fine – log-likelihood must converge)
tm_toy_lda notebook shows step-by-step implementation
* unless your random number generator is manually set to a certain state and your data and hyperparameters are the same
Note: x-axis is number of topics, y-axis is normalized scale for different "model quality" metrics
Problem: Your data does not meet the assumptions for LDA, e.g.:
Problem: You do not know which priors to set (num. topics $K$, $\alpha$, $\beta$)
Problem: A lot of uninformative words appear in my topics
Problem: Some topics are very general and uninformative
Problem: Computing the models takes very long
Problem: Out of memory errors (corpus is too big for memory)