Markus Konrad markus.konrad@wzb.eu
May 17, 2018
Material will be available at: http://dsspace.wzb.eu/pyug/topicmodeling2/
Topic modeling is an unsupervised machine learning method to discover abstract topics within a collection of unlabelled documents.
Each collection of documents (corpus) contains a "latent" or "hidden" structure of topics. Some topics are more prominent in the whole corpus, some less. In each document there are multiple topics covered, each to a different amount.
The latent variable $z$ describes the topic structure, as each word of each document is thought to be implicitly assigned to a topic.
General idea: each document is generated from a mixture of topics and each of those topics is a mixture of words
LDA stands for Latent Dirichlet Allocation, which can be interpreted as (Tufts 2018):
* documents can be anything (news articles, scientific articles, books, chapters of books, paragraphs, etc.)
An LDA topic model (i.e. its "mixtures") can be described by two distributions:
What are the topics that appear in the corpus? Which words are prominent in which topics?
Each topic has a distribution over all words in the corpus (vocabulary):
topic | russia | putin | soccer | bank | finance | possible interpretation |
---|---|---|---|---|---|---|
topic 1 | 0.4 | 0.4 | 0.0 | 0.1 | 0.1 | russian politics |
topic 2 | 0.3 | 0.0 | 0.6 | 0.1 | 0.0 | soccer in russia |
topic 3 | 0.2 | 0.0 | 0.0 | 0.4 | 0.4 | russian economy |
Which topics appear in which documents?
Each document has a different distribution over all topics:
document | topic 1 | topic 2 | topic 3 | possible interpretation |
---|---|---|---|---|
doc. 1 | 0.0 | 0.3 | 0.7 | mostly about soccer and russian economy |
doc. 2 | 0.9 | 0.0 | 0.1 | russian politics and a bit of economy |
doc. 3 | 0.3 | 0.5 | 0.2 | all three topics |
Either: Expectation-Maximization algorithm – an optimization algorithm (Blei, Ng, Jordan 2003)
Or: Gibbs sampling algorithm – a "random walk" algorithm (Griffiths & Steyvers 2004) with iterative resampling
three parameters specify prior beliefs on the data:
there is not one valid set of parameters; you choose if you want a few (but more general) topics or plenty (but more specific) topics
text
fields* the length of your individual documents should not be too inbalanced and not too short for Topic Modeling
$D_1$: "Regarding the financial situation of Russia, President Putin said ..."
$D_2$: "In the first soccer game, he only sat on the bank ..."
$D_3$: "The conference on banking and finance ..."
document | russia | putin | soccer | bank | finance | ... |
---|---|---|---|---|---|---|
$D_1$ | 3 | 1 | 0 | 1 | 2 | ... |
$D_2$ | 0 | 0 | 2 | 1 | 0 | ... |
$D_3$ | 0 | 0 | 0 | 2 | 4 | ... |
Example: Herr Schröder, Sie hatten das Stichwort „Sportgroßveranstaltungen“ bemüht. Dazu sage ich...
Step | Method | Output |
---|---|---|
1 | tokenize | [Herr / Schröder / , / Sie / hatten / das / Stichwort / „Sportgroßveranstaltungen / “ / bemüht / . / Dazu / sage / ich / ...] |
2 | POS tagging | [Herr – NN / Schröder – NE / , – $ / Sie – PPER / hatten – VAFIN / ...] |
3 | lemmatization | [Herr / Schröder / , / Sie / haben / das / Stichwort / „Sportgroßveranstaltungen / “ / bemühen / . / Dazu / sagen / ich / ...] |
4 | to lower case | [herr / schröder / , / sie / haben / das / stichwort / „sportgroßveranstaltungen / “ / bemühen / . / dazu / sagen / ich / ...] |
Step | Method | Output |
---|---|---|
5 | remove special characters | [herr / schröder / / sie / haben / das / stichwort / sportgroßveranstaltungen / / bemühen / / dazu / sagen / ich / ...] |
6 | remove stopwords and empty tokens | [herr / schröder / stichwort / sportgroßveranstaltungen / bemühen / sagen / ...] |
7 | remove tokens that appear in more than 90% of the documents | [herr / schröder / stichwort / sportgroßveranstaltungen / bemühen / ...] |
8 | remove tokens that appear in less than 4 documents | [herr / schröder / stichwort / sportgroßveranstaltungen / bemühen / ...] |
→ finally, generate DTM as input for topic modeling algorithm
→ see if domain expert can spot the intruder
→ if topics are coherent, intruders should be spotted easily
→ which properties should "ideal" posterior distributions $\phi$ and $\theta$ have?
ldatuning
*not sure if computations are done in parallel
Actions for second iteration:
→ add more words to stopword list
→ remove salutatory addresses ("Herr Präsident! Sehr geehrte Kolleginnen und Kollegen! Meine Damen und Herren! ...")
→ either further tuning or ignore topics that are identified as uninformative
Source code available at: https://github.com/WZBSocialScienceCenter/tm_bundestag