# Probabilistic Topic Modeling with LDA¶

## Practical topic modeling: Preparation, evaluation, visualization¶

### Python User Group Workshop¶

May 17, 2018

Material will be available at: http://dsspace.wzb.eu/pyug/topicmodeling2/

## Outline¶

• Recap:
• Topic Modeling in a nutshell
• Hyperparameters $K$, $\alpha$ and $\beta$
• A topic model for the parliamentary debates of the 18th German Bundestag
• Data overview
• Data preparation
• Model evaluation and selection with model quality metrics
• Visualization
• Some results from the topic model

## Recap¶

### What is Topic Modeling?¶

Topic modeling is an unsupervised machine learning method to discover abstract topics within a collection of unlabelled documents.

Each collection of documents (corpus) contains a "latent" or "hidden" structure of topics. Some topics are more prominent in the whole corpus, some less. In each document there are multiple topics covered, each to a different amount.

The latent variable $z$ describes the topic structure, as each word of each document is thought to be implicitly assigned to a topic.

### The LDA topic model¶

General idea: each document is generated from a mixture of topics and each of those topics is a mixture of words

LDA stands for Latent Dirichlet Allocation, which can be interpreted as (Tufts 2018):

• topic structures in a document are latent meaning they are hidden structures in the text
• the Dirichlet distribution determines the mixture proportions of the topics in the documents and the words in each topic
• Allocation of words to a given topic

### The LDA topic model – Assumptions¶

• order of words in documents does not matter → "bag of words" model
• order of documents* in a corpus does not matter
• number of topics $K$ is known (has to be set in advance)

* documents can be anything (news articles, scientific articles, books, chapters of books, paragraphs, etc.)

### The LDA topic model¶

An LDA topic model (i.e. its "mixtures") can be described by two distributions:

• a topic-word distribution $\phi$: each topic has a distribution over a fixed vocabulary of $N$ words
• a document-topic distribution $\theta$: each document has a distribution over a fixed number of topics $K$

### topic-word distribution $\phi$¶

What are the topics that appear in the corpus? Which words are prominent in which topics?

Each topic has a distribution over all words in the corpus (vocabulary):

topic russia putin soccer bank finance possible interpretation
topic 1 0.4 0.4 0.0 0.1 0.1 russian politics
topic 2 0.3 0.0 0.6 0.1 0.0 soccer in russia
topic 3 0.2 0.0 0.0 0.4 0.4 russian economy
• $K$ (num. of topics) distributions across $W$ unique words
• topics are a mixture of words → have different weights on words
• topics are abstract – interpretation by examining the distribution

### document-topic distribution $\theta$¶

Which topics appear in which documents?

Each document has a different distribution over all topics:

document topic 1 topic 2 topic 3 possible interpretation
doc. 1 0.0 0.3 0.7 mostly about soccer and russian economy
doc. 2 0.9 0.0 0.1 russian politics and a bit of economy
doc. 3 0.3 0.5 0.2 all three topics
• $D$ (num. of documents) distributions across $K$ topics
• documents are a mixture of topics, each to a different proportion

## How do we estimate $\phi$ and $\theta$?¶

Either: Expectation-Maximization algorithm – an optimization algorithm (Blei, Ng, Jordan 2003)

Or: Gibbs sampling algorithm – a "random walk" algorithm (Griffiths & Steyvers 2004) with iterative resampling

## Hyperparameters in LDA¶

three parameters specify prior beliefs on the data:

• number of topics $K$ – can be found out with model quality metrics
• concentration parameters $\alpha$ and $\beta$ – sparsity of topics ($\alpha$) and words ($\beta$)

there is not one valid set of parameters; you choose if you want a few (but more general) topics or plenty (but more specific) topics

### $\alpha$ as prior belief on sparsity of topics in the documents

• when using **high $\alpha$**: each document covers many topics (lower impact of topic sparsity)
• when using **low $\alpha$**: each document covers only few topics (higher impact of topic sparsity)
• $\alpha$ is often set to a fraction of the number of topics $K$, e.g. $\alpha=1/K$
→ with increasing $K$, we expect that each document covers fewer, but more specific topics

### $\beta$ as prior belief on sparsity of words in the topics

• when using **high $\beta$**: each topic consists of many words (lower impact of word sparsity) → more general topics
• when using **low $\beta$**: each topic consists of few words (higher impact of word sparsity) → more specific topics
• $\beta$ can be used to control "granularity" of a topic model
• high $\beta$: fewer topics, more general
• low $\beta$: more topics, more specific

# A topic model for the parliamentary debates of the 18th German Bundestag¶

## The data¶

### Further notes on the data¶

• data was chosen to act as an example (i.e. not driven by a research question)
• selected as example because:
• data is not trivial to prepare for topic modeling (we'll see why)
• it's in German (more difficult to preprocess than English)
• amount of data is neither too small nor too big (i.e. does not take ages to compute)
• results can be compared with analyses from offenesparlament.de

## Characteristics of the data¶

• CSV files for each plenary session (UTF-8 encoded) with variables:
• sequence: chronological order
• speaker: connected to speaker metadata like age, party, etc.
• top ("Tagesordnungspunkt"): range from very specific ("Bundeswehreinsatz in Südsudan") to very general ("Fragestunde")
• type: categorical "chair", "poi" or "speech" – we only need "speech"
• text: the speakers statement
• missings:
• session #191 was not split into individual speeches (i.e. is a single huge entry)
• amount:
• 243 sessions (excl. #191) with 136,932 speech records in total

## Examine your raw data closely!¶

• speeches are divided into several entries (each time when applause, shouts or other calls interrupt the speaker) → should be merged together
• consecutive speech entries with same speaker / same TOP form a speach
• good side-effect: avoids problems with very short speech entries*
• group data by speaker and TOP → concatenate each groups' text fields

* the length of your individual documents should not be too inbalanced and not too short for Topic Modeling

## Data preparation for Topic Modeling¶

• LDA works with Bag-of-Words assumption; each document is just a set of word counts → word order does not matter
• textual data must be transformed to Document-Term-Matrix (DTM):

$D_1$: "Regarding the financial situation of Russia, President Putin said ..."
$D_2$: "In the first soccer game, he only sat on the bank ..."
$D_3$: "The conference on banking and finance ..."

document russia putin soccer bank finance ...
$D_1$ 3 1 0 1 2 ...
$D_2$ 0 0 2 1 0 ...
$D_3$ 0 0 0 2 4 ...

## Text preprocessing pipeline¶

Example: Herr Schröder, Sie hatten das Stichwort „Sportgroßveranstaltungen“ bemüht. Dazu sage ich...

Step Method Output
1 tokenize [Herr / Schröder / , / Sie / hatten / das / Stichwort / „Sportgroßveranstaltungen / “ / bemüht / . / Dazu / sage / ich / ...]

### Model inspection results¶

• less amount of too general topics and words
• still ~ 10–20 uninformative topics (incoherent and/or too general)

→ either further tuning or ignore topics that are identified as uninformative