Text Processing and Feature Extraction for Quantitative Text Analysis

Python User Group Workshop

Markus Konrad markus.konrad@wzb.eu

June 2017


  • Introduction
  • Text preprocessing
    • Text tokenization
    • Text normalization
    • Text parsing and filtering
  • Feature Extraction
    • Bag-of-Words model
    • tf-idf model


Text analysis pipeline:

Collected text files → processed/normalized text data → extracted features → model

Book: D. Sarkar, Text Analytics with Python (apress 2016)

Please note: The code examples are only provided to show the basic concepts. You should use the recommended Python packages for real applications!

Text preprocessing

Goal: Transform raw text input into normalized sequence of tokens. Prepare for feature extraction.

"Hi. This is an example sentence in an Example Document." → [hi, example, sentence, example, document] → [1, 2, 1, 1]

Text processing includes many steps and hence many decisions that have big effect on your results. Several possibilities will be shown here. If and how to apply them depends heavily on your data and your later analysis.

The document corpus

A corpus contains the documents that we want to process. Each document can be accessed by a unique document label or document ID. The document itself is usually a (very long) character string (Python type: str) that may contain line breaks.

You normally load a corpus from files, a database or other sources.

In [1]:
# a small toy corpus with some (adapted) German newspaper headlines from June 20th
corpus = {   # document label: document text
    'spon1': 'Mehr Zustimmung zur EU auch wegen Trump – Danke May, danke Trump',
    'spon2': 'Nach Tod von US-Student Warmbier: Trump beschuldigt Nordkorea',
    'focus': 'Tod von US-Student Warmbier – Trump beschuldigt Nordkorea-Regime',
    'xyz': 'EU bleibt EU, aber EU-US-Beziehungen unter Trump weiter angespannt',   # stupid made up headline
In [2]:
# access by document label
'Mehr Zustimmung zur EU auch wegen Trump – Danke May, danke Trump'


Goal: Break down document text into smaller, meaningful components (paragraphs, sentences, words) → from a document, form a list of tokens

In our case: We apply word tokenization, so token = word

With plain Python: calling split() on a string splits it by whitespace:

In [3]:
['EU', 'bleibt', 'EU,', 'aber', 'EU-US-Beziehungen', 'unter', 'Trump', 'weiter', 'angespannt']
In [4]:
['Nach', 'Tod', 'von', 'US-Student', 'Warmbier:', 'Trump', 'beschuldigt', 'Nordkorea']

Tokenization is not trivial.

  • how to handle punctuation, quotes, hyphens?
  • how to handle contractions? ("don't" or "wasn't")

→ depends on your text (language, source/medium)

In [5]:
import nltk

# word_tokenize uses TreebankWordTokenizer by default
# set language to "german" to use German punctuation
print(nltk.word_tokenize(corpus['spon2'], language="german"))
['Nach', 'Tod', 'von', 'US-Student', 'Warmbier', ':', 'Trump', 'beschuldigt', 'Nordkorea']
In [6]:
nltk.word_tokenize("I wasn't there.")   # default language is English
['I', 'was', "n't", 'there', '.']
In [7]:
# tokenize whole corpus
tokens = {doc_label: nltk.word_tokenize(text, language="german")
          for doc_label, text in corpus.items()}
dict_keys(['spon1', 'xyz', 'focus', 'spon2'])
In [8]:
['Mehr', 'Zustimmung', 'zur', 'EU', 'auch', 'wegen', 'Trump', '–', 'Danke', 'May', ',', 'danke', 'Trump']

Text normalization

Can involve:

  • expanding contractions
  • expanding hyphenated compound words
  • removing special characters
  • case conversion
  • removing stopwords
  • correct spelling
  • stemming / lemmatization

The order is important!

Expanding contractions

  • strategy: make list of all possible contractions and their expanded replacement
  • search & replace with Python using regular expressions
  • not relevant here
  • see "correct spelling" later

Expanding hyphenated compound words

  • how to handle words like "US-Student"?
    • leave as is
    • strip hyphens (see "removing special characters" later)
    • split by hyphens
In [9]:
# example to split by hyphen
split_tokens = []
for t in tokens['focus']:
['Tod', 'von', 'US', 'Student', 'Warmbier', '–', 'Trump', 'beschuldigt', 'Nordkorea', 'Regime']

Problem: Would also split "e-mail" → ["e", "mail"]!

Removing special characters

  • decide which special characters are not of interest → list of special characters that should be removed
  • decision: remove any special characters in tokens/words ("US-Student" → "USStudent") or only sole characters?
  • big effect on later steps, especially Part-of-Speech tagging!

Several ways, e.g. with regular expressions or str.translate:

In [10]:
import string
In [11]:
del_chars = str.maketrans('', '', string.punctuation + '–')   # add another character "–"
print([t.translate(del_chars) for t in tokens['focus']])   # apply table "del_chars"
['Tod', 'von', 'USStudent', 'Warmbier', '', 'Trump', 'beschuldigt', 'NordkoreaRegime']

Our strategy: Split only if first compound word is possibly longer than one character.

In [12]:
def expand_compound_token(t, split_chars="-"):
    parts = []
    add = False   # signals if current part should be appended to previous part
    for p in t.split(split_chars):  # for each part p in compound token t
        if not p: continue  # skip empty part
        if add and parts:   # append current part p to previous part
            parts[-1] += p
        else:               # add p as separate token
        add = len(p) <= 1   # if p only consists of a single character -> append the next p to it
        #add = p.isupper()   # alt. strategy: if p is all uppercase ("US", "E", etc.) -> append the next p to it

    return parts

['US', 'Student']
['Nordkorea', 'Regime']
['EMail', 'Provider']
In [13]:
tmp_tokens = {}
for doc_label, doc_tok in tokens.items():
    tmp_tokens[doc_label] = []
    for t in doc_tok:
        t_parts = expand_compound_token(t)

print('Old:', tokens['focus'])
print('New:', tmp_tokens['focus'])
tokens = tmp_tokens
Old: ['Tod', 'von', 'US-Student', 'Warmbier', '–', 'Trump', 'beschuldigt', 'Nordkorea-Regime']
New: ['Tod', 'von', 'US', 'Student', 'Warmbier', '–', 'Trump', 'beschuldigt', 'Nordkorea', 'Regime']

Case conversion

Usually: convert all words to lowercase.

Can be problematic because of "capitonyms":

  • e.g. in English: "May" ≠ "may", "Pole" ≠ "pole"
  • or in German (much more frequent): "Morgen" ≠ "morgen", "Laut" ≠ "laut"

Proper Part-of-Speech tagging might not be possible afterwards!

Methods in Python: str.lower(), str.upper()

In [14]:
print([t.lower() for t in tokens['focus']])
['tod', 'von', 'us', 'student', 'warmbier', '–', 'trump', 'beschuldigt', 'nordkorea', 'regime']

Removing stopwords

Stopwords are words that are removed before doing further text analysis. Usually: Very common words for a certain language that transport little information.

Stopword list depends on:

  • language
  • your data / research scenario (filter out too common words)
  • later text analysis method, e.g.:
    • tf-idf automatically reduces importance of very common words (as opposed to Bag-of-Words)
    • sentiment analysis: bad idea to have words like "not" in the stopword list!

NLTK has a list of stopwords for some languages:

In [15]:
print('English:', nltk.corpus.stopwords.words('english')[:5], '...')
print('German:', nltk.corpus.stopwords.words('german')[:5], '...')
English: ['i', 'me', 'my', 'myself', 'we'] ...
German: ['aber', 'alle', 'allem', 'allen', 'aller'] ...
In [16]:
# usage example (will remove "von" tokens):
stopwords = nltk.corpus.stopwords.words('german')
[t for t in tokens['focus'] if t.lower() not in stopwords]

Correct spelling

Depends on your data → especially necessary when working with social media data, surveys, etc.

Available packages for automatic spell correction:

Stemming or Lemmatization

Goal: Reduce inflected words to a common form so that they're counted as one.


Remove affixes from a word to get base form (stem) of a word → stem might not be a lexicographically correct word

  • books → book
  • booked → book
  • employees → employ
  • argued → argu

NLTK implements several stemming algorithms:

  • PorterStemmer, LancasterStemmer (English only)
  • SnowballStemmer (supports 13 languages)
In [17]:
stemmer = nltk.stem.LancasterStemmer()
In [18]:
stemmer = nltk.stem.SnowballStemmer('german')
print('Bücher →', stemmer.stem("Bücher"))
print('gebuchte →', stemmer.stem("gebuchte"))
print('sahen →', stemmer.stem("sahen"))
Bücher → buch
gebuchte → gebucht
sahen → sah


Find lemma (dictionary form) of a inflected word → a lemma is always a lexicographically correct word

Implemented for English in NLTK with WordNetLemmatizer.

In [19]:
lemmatizer = nltk.stem.WordNetLemmatizer()
# lemmatize(): first argument is word, second is Part-of-Speech tag
print('books →', lemmatizer.lemmatize('books', 'n'))    # n stands for noun
print('booked →', lemmatizer.lemmatize('booked', 'v'))  # v stands for verb
print('employees →', lemmatizer.lemmatize('employees', 'n'))
print('argued →', lemmatizer.lemmatize('argued', 'v'))
books → book
booked → book
employees → employee
argued → argue

Lemmatization ...

  • ... requires Part-of-Speech tags (noun, verb, adjective, etc.)
  • ... is hard for certain languages and there are almost no freely available lemmatizers for other languages than English
    • pattern (partly) supports: de, fr, es, it, nl
    • germalemma achieves 74% to 84% accuracy for German text


We have:

  • tokenized our corpus
  • expanded compound words

What's still necessary:

  • Part-of-Speech (POS) tagging
  • lemmatization (requires POS tags)
  • convert to lower case
  • remove special characters
  • optionally filter tokens: remove stopwords, filter by POS tag

Text parsing

→ to understand text syntax and structure

  • Part-of-Speech (POS) tagging → annotate words with lexical categories
  • Shallow parsing / chunking → split sentences into phrases (NLTK book ch. 7)

NLTK chunking

![Difference betw. Dependency-based and Constituency-based parsing](img/dep_const_difference.jpg)
  • Dependency-based parsing
  • Constituency-based parsing

POS tagging

  • Goal: assign a lexical category such as noun, verb, adjective, etc. to each word
  • needed for lemmatization
  • optionally needed for filtering (e.g. nouns only)
  • NLTK implements several trained taggers → trained with a large text corpus that is annotated with a certain tagset
  • by default: nltk.pos_tag() for English with Penn Treebank tagset
In [20]:
example = ['The', 'little', 'yellow', 'dog', 'barked', 'loudly', 'at', 'the', 'cat', '.']
nltk.pos_tag(example)    # with default tagset (Penn Treebank)
[('The', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('loudly', 'RB'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cat', 'NN'),
 ('.', '.')]
In [21]:
nltk.pos_tag(example, tagset='universal')   # with universal tagset
[('The', 'DET'),
 ('little', 'ADJ'),
 ('yellow', 'ADJ'),
 ('dog', 'NOUN'),
 ('barked', 'VERB'),
 ('loudly', 'ADV'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('cat', 'NOUN'),
 ('.', '.')]

For German?

In [22]:
# load a pre-trained german tagger based on ClassifierBasedGermanTagger by Philipp Nolte
import pickle
with open('pos_tagger_german.pickle', 'rb') as f:
    ger_tagger = pickle.load(f)

ger_tagger.tag(['Der', 'kleine', 'gelbe', 'Hund', '.'])
[('Der', 'ART'),
 ('kleine', 'ADJA'),
 ('gelbe', 'ADJA'),
 ('Hund', 'NN'),
 ('.', '$.')]
In [23]:
# let's tag our corpus!
tagged_tokens = {}
for doc_label, doc_tok in tokens.items():
    tagged_tokens[doc_label] = ger_tagger.tag(doc_tok)

[('Nach', 'APPR'),
 ('Tod', 'NN'),
 ('von', 'APPR'),
 ('US', 'NE'),
 ('Student', 'NN'),
 ('Warmbier', 'NE'),
 (':', '$.'),
 ('Trump', 'NE'),
 ('beschuldigt', 'VVFIN'),
 ('Nordkorea', 'NE')]

Ready for lemmatization

  • recap:
    • no freely available lemmatizer for German
    • partly implemented in pattern → ~74% accuracy with TIGER corpus
  • improved lemmatizer germalemma (see this blog post) achieves ~84% accuracy
In [24]:
from germalemma import GermaLemma

lemmatizer = GermaLemma()
lemmatizer.find_lemma('beschuldigt', 'VVFIN')
In [25]:
# let's lemmatize our corpus
tmp_tokens = {}
for doc_label, tok_pos in tagged_tokens.items():
    lemmata_pos = []
    for t, pos in tok_pos:
            l = lemmatizer.find_lemma(t, pos)
        except ValueError:
            l = t
        lemmata_pos.append((l, pos))
    tmp_tokens[doc_label] = lemmata_pos

[('Nach', 'APPR'),
 ('Tod', 'NN'),
 ('von', 'APPR'),
 ('US', 'NE'),
 ('Student', 'NN'),
 ('Warmbier', 'NE'),
 (':', '$.'),
 ('Trump', 'NE'),
 ('beschuldigen', 'VVFIN'),
 ('Nordkorea', 'NE')]
In [26]:
tagged_tokens = tmp_tokens

Final normalization steps

  • transform lowercase
  • remove special characters
  • remove stopwords
In [27]:
stopwords = nltk.corpus.stopwords.words('german') + ['mehr', 'wegen']  # add more words
del_chars = str.maketrans('', '', string.punctuation + '–')   # add another character "–"

tmp_tokens = {}
for doc_label, tok_pos in tagged_tokens.items():
    tok_pos = [(t.lower(), pos) for t, pos in tok_pos]   # to lowercase
    tok_pos = [(t.translate(del_chars), pos) for t, pos in tok_pos]   # remove special char.
    tok_pos = [(t, pos) for t, pos in tok_pos   # remove empty tokens and stopwords
               if t and t not in stopwords]  
    tmp_tokens[doc_label] = tok_pos

print('Old:', [x[0] for x in tagged_tokens['spon1']])
print('New:', [x[0] for x in tmp_tokens['spon1']])
Old: ['Mehr', 'Zustimmung', 'zur', 'EU', 'auch', 'wegen', 'Trump', '–', 'Danke', 'May', ',', 'danke', 'Trump']
New: ['zustimmung', 'eu', 'trump', 'danke', 'may', 'danke', 'trump']
In [28]:
tagged_tokens = tmp_tokens

Filtering by POS tag

→ filter words by lexical categories, e.g. only nouns:

In [29]:
[(t, pos) for t, pos in tagged_tokens['spon1']
          if pos.startswith('N')]
[('zustimmung', 'NN'),
 ('eu', 'NE'),
 ('trump', 'NN'),
 ('danke', 'NE'),
 ('may', 'NE'),
 ('trump', 'NE')]

Text normalization summary

  • many steps from raw input text to normalized tokens
    1. tokenization
    2. expand compound words
    3. POS tagging
    4. lemmatization
    5. lower-case transformation
    6. removing special characters
    7. removing stopwords </small>
  • each step involves decisions that highly effect further analyses

Reproducibility is important!

  • document each step
  • provide scripts (with code comments) and data with your publication
In [30]:
from pprint import pprint
{'focus': [('tod', 'NN'),
           ('us', 'NE'),
           ('student', 'NN'),
           ('warmbier', 'NE'),
           ('trump', 'FM'),
           ('beschuldigen', 'VVPP'),
           ('nordkorea', 'NE'),
           ('regime', 'NN')],
 'spon1': [('zustimmung', 'NN'),
           ('eu', 'NE'),
           ('trump', 'NN'),
           ('danke', 'NE'),
           ('may', 'NE'),
           ('danke', 'PRELS'),
           ('trump', 'NE')],
 'spon2': [('tod', 'NN'),
           ('us', 'NE'),
           ('student', 'NN'),
           ('warmbier', 'NE'),
           ('trump', 'NE'),
           ('beschuldigen', 'VVFIN'),
           ('nordkorea', 'NE')],
 'xyz': [('eu', 'NE'),
         ('bleiben', 'VVFIN'),
         ('eu', 'NE'),
         ('eu', 'NE'),
         ('us', 'NE'),
         ('beziehung', 'NN'),
         ('trump', 'NE'),
         ('angespannt', 'VVPP')]}
  • NLTK – stable but slow
  • pattern – many language models but some of them only with low accuracy, Python 2.7 only
  • spacy – language models for English and partly for German and French
  • SyntaxNet – many language models but difficult to install, Python 2.7 only
  • Stanford CoreNLP – many language models but requires Java

Feature Extraction

Features are derived values from our complex data. They should measure certain distinctive properties of our data in order to achieve dimensionality reduction. For each observation a feature vector is created (usually with numerical or categorical values) → Vector Space Model.

Example: A feature vector consisting of three features:

  1. Token length
  2. Number of vowels
  3. Number of consonants
In [31]:
observations = [
    'welcome', 'bienvenue', 'willkommen', 'privetstvie',
vowels = list('AEIOUaeiou')
features = []
for obs in observations:
    n_tok = len(obs)
    n_vow = sum([c in vowels for c in obs])
    features.append((n_tok, n_vow, n_tok - n_vow))
list(zip(observations, features))
[('welcome', (7, 3, 4)),
 ('bienvenue', (9, 5, 4)),
 ('willkommen', (10, 3, 7)),
 ('privetstvie', (11, 4, 7))]

For linguists, these features might already be interesting. Using machine learning, it might be possible to detect which language a word comes from.

Choosing the right properties for your features greatly depends on want you want to analyse / which methods you want to use → own discipline "Feature engineering"

We will concentrate on Term vector models:

  • in a corpus $C$ we have a set of $n$ documents[1] $D_1, D_2, \dots, D_n$ containing terms[2] $t$
  • all unique terms in $C$ make up the vocabulary
  • each document contains a feature vector $d$ of length $m = N_{vocabulary}$
  • a feature vector contains weights $w_i$ of the $i$th term of the vocabulary in that document

$M = \{d_1,d_2, ..., d_n\}$ with $d = \{w_1, w_2, ..., w_m\}$

[1]: Documents are the things you compare. They can be paragraphs, sentences, tweets, articles, etc.
[2]: A.k.a. tokens or words in this context.


  • completely based on term weights → weights might denote "importance" of terms in a given document
  • term weights usually derived from term frequency
  • does generally not take into account: word order, grammar/syntactic structure → information how words relate to each other in a document is lost
  • useful for:
    • Text classification (Spam/Not Spam, categories, language)
    • Summarization / topic discovery
    • Text similarity / clustering
  • not useful for:
    • Semantic and Sentiment Analysis

Bag-of-Words (BoW) model

  • simple but powerful model
  • features are absolute term counts
  • basis for:
    • Topic Modeling with Latent Dirichlet Allocation (LDA) via Gibbs sampling
    • Text classification with Naive Bayes, Support Vector Machines
    • Document similarity
    • Document clustering
    • ...


\begin{equation*} C = \{D_1, D_2, D_3\} \\ D_1=\{simple, yet, beautiful, example\} \\ D_2=\{beautiful, beautiful, flowers\} \\ D_3=\{example, after, example\} \end{equation*}
\begin{equation*} vocab=\{simple, yet, beautiful, example, flowers, after\} \end{equation*}
document simple yet beautiful example flowers after
$D_1$ 1 1 1 1 0 0
$D_2$ 0 0 2 0 1 0
$D_3$ 0 0 0 2 0 1
\begin{equation*} M=\begin{pmatrix} 1 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 & 1 & 0 \\ 0 & 0 & 0 & 2 & 0 & 1 \\ \end{pmatrix} \end{equation*}

Example implementation

  • we use a Counter to count the term frequencies in our corpus
In [32]:
from collections import Counter

example_data = ['a', 'b', 'c', 'b', 'b', 'a']
example_counter = Counter(example_data)
Counter({'a': 2, 'b': 3, 'c': 1})
In [33]:
example_counter.update(['c', 'a', 'a', 'a'])
Counter({'a': 5, 'b': 3, 'c': 2})

Our normalized tokens with their POS tags are still in the variable tagged_tokens:

In [34]:
{'focus': [('tod', 'NN'),
           ('us', 'NE'),
           ('student', 'NN'),
           ('warmbier', 'NE'),
           ('trump', 'FM'),
           ('beschuldigen', 'VVPP'),
           ('nordkorea', 'NE'),
           ('regime', 'NN')],
 'spon1': [('zustimmung', 'NN'),
           ('eu', 'NE'),
           ('trump', 'NN'),
           ('danke', 'NE'),
           ('may', 'NE'),
           ('danke', 'PRELS'),
           ('trump', 'NE')],
 'spon2': [('tod', 'NN'),
           ('us', 'NE'),
           ('student', 'NN'),
           ('warmbier', 'NE'),
           ('trump', 'NE'),
           ('beschuldigen', 'VVFIN'),
           ('nordkorea', 'NE')],
 'xyz': [('eu', 'NE'),
         ('bleiben', 'VVFIN'),
         ('eu', 'NE'),
         ('eu', 'NE'),
         ('us', 'NE'),
         ('beziehung', 'NN'),
         ('trump', 'NE'),
         ('angespannt', 'VVPP')]}
In [35]:
documents = {doc_label: [t for t, _ in tok_pos]   # dismiss the POS tag
             for doc_label, tok_pos in tagged_tokens.items()}

1. Count the tokens for each document:

In [36]:
counts = {doc_label: Counter(tok) for doc_label, tok in documents.items()}
print('tokens:', documents['spon1'])
print('counts:', list(counts['spon1'].items()))
tokens: ['zustimmung', 'eu', 'trump', 'danke', 'may', 'danke', 'trump']
counts: [('may', 1), ('trump', 2), ('zustimmung', 1), ('eu', 1), ('danke', 2)]

2. extract the vocabulary (set of unique terms in all documents):

In [37]:
vocab = set()
for counter in counts.values():
    vocab |= set(counter.keys())   # set union of unique tokens per document

vocab = sorted(list(vocab))  # sorting here only for better display later
vocab   # => becomes columns of BoW matrix

3. Create the BoW matrix:

In [38]:
# create Bag of Words matrix: rows are documents, columns are vocabulary words (unique tokens)
bow = []
for counter in counts.values():  # iterate through each document counter instance
    # make a list that contains the term count of each term in this document
    # if a term of the vocab. does not exist in this document, set it to 0 (default value of .get())
    bow_row = [counter.get(term, 0) for term in vocab]
[[0, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 2, 0, 0, 1],
 [1, 0, 1, 1, 0, 3, 0, 0, 0, 0, 0, 1, 1, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0]]
In [39]:
doc_labels = list(counts.keys())   # => becomes rows of BoW matrix
['spon1', 'xyz', 'focus', 'spon2']
In [40]:
from utils import plot_heatmap

# show a heatmap of the BoW model
print('spon1:', documents['spon1'])
plot_heatmap(bow, xticklabels=vocab, yticklabels=doc_labels, title='BoW', save_to='img/bow.png')
spon1: ['zustimmung', 'eu', 'trump', 'danke', 'may', 'danke', 'trump']
<matplotlib.figure.Figure at 0x7faea2032278>


BoW can be used in conjuction with n-grams.

an n-gram is a contiguous sequence of n items from a given sequence of text or speech


In [41]:
# 1-grams (unigrams):
['zustimmung', 'eu', 'trump', 'danke', 'may', 'danke', 'trump']
In [42]:
from utils import create_ngrams

# 2-grams (bigrams):
print(create_ngrams(documents['spon1'], n=2))
['zustimmung eu', 'eu trump', 'trump danke', 'danke may', 'may danke', 'danke trump']

Bigrams of our tokens:

In [43]:
documents_bigrams = {doc_label: create_ngrams(doc_tok, n=2)
                     for doc_label, doc_tok in documents.items()}
['zustimmung eu',
 'eu trump',
 'trump danke',
 'danke may',
 'may danke',
 'danke trump']
In [44]:
from utils import create_bow

bow_bi, doc_labels_bi, vocab_bi = create_bow(documents_bigrams)
plot_heatmap(bow_bi, xticklabels=vocab_bi, yticklabels=doc_labels_bi, title='Bigram BoW');
<matplotlib.figure.Figure at 0x7faea20324e0>


Problem with BoW: Common words that occur often in many documents overshadow more specific (potentially more interesting) words → can be reduced with stopwords → manual effort

tf-idf tries to decrease the weight of words that occur across many documents → lower the weight of common words.

\begin{equation*} tfidf_C(t, D) = tf(t, D) \cdot idf_C(t) \end{equation*}
  • $tf$ .. term frequency – related to BoW (raw count or proportion for $t$ in $D$)
  • $idf$ .. inverse document frequency – measures how common a word $t$ is across all documents in corpus $C$

term frequency

Better to use relative frequencies than absolute counts: We calculate the term count proportions $tf(t, D) = \frac{N_{t,D}}{|D|}$ for a term $t$ in a document $D$. This prevents that documents with many words get higher weights than those with few words.

We need to convert our BoW to a NumPy matrix type for easier calculation.

In [45]:
import numpy as np
raw_counts = np.mat(bow, dtype=float)         # raw counts converted to NumPy matrix
tf = raw_counts / np.sum(raw_counts, axis=1)  # divide by row-wise sums (document lengths) -> proportions
plot_heatmap(tf, xticklabels=vocab, yticklabels=doc_labels, title='tf / BoW', legend=True, save_to='img/tf.png');
<matplotlib.figure.Figure at 0x7fae9773eac8>

idf – inverse document frequency

Different weighting schemes available. We use this one:

\begin{equation} idf_C(t) = log (1+\frac{n}{1+|D \in C : t \in D|}) \end{equation}
  • $t$ .. a term (a.k.a. token or word)
  • $n$ .. number of documents in corpus $C$
  • $|D \in C : t \in D|$ .. number of documents in which $t$ appears

  • plus 1 in denominator to avoid division by zero for unknown words, plus 1 in log to avoid negative numbers

In [46]:
def num_term_in_docs(t, docs):
    return sum(t in d for d in docs.values())

num_term_in_docs('eu', documents)
In [47]:
from math import log

# define a function that calculates the inverse document frequency
def idf(t, docs):
    return log(1 + len(docs) / (1+num_term_in_docs(t, docs)))

idf('eu', documents)
In [48]:
idf_row = [idf(t, documents) for t in vocab]
In [49]:
plot_heatmap(np.mat(idf_row), vocab, title="idf(t)", ylabel=None, legend=True);
<matplotlib.figure.Figure at 0x7fae97699240>

Create tfidf matrix by converting idf_row to a diagonal matrix and multiplying tf with it:

In [50]:
idf_mat = np.mat(np.diag(idf_row))
tfidf = tf * idf_mat
In [51]:
plot_heatmap(bow, xticklabels=vocab, yticklabels=doc_labels, title='BoW', legend=True, save_to='img/bow.png')
plot_heatmap(tfidf, xticklabels=vocab, yticklabels=doc_labels, title='tf-idf', legend=True, save_to='img/tfidf.png');
<matplotlib.figure.Figure at 0x7fae97477668>
<matplotlib.figure.Figure at 0x7fae9734bf60>



Values in tf-idf matrix are dependent on term frequency (tf) and the inverse document frequency (idf_mat). Tokens that occur in many documents (low idf value) get lower individual tf-idf values.



\begin{align} tf(danke, spon1) & = 2/7 = 0.2857 \\ idf(danke) & = log(1+4/2) = 1.0986 \\ tfidf(danke, spon1) & = 0.2857 \cdot 1.0986 = 0.3139 \end{align}



\begin{align} tf(trump, spon1) & = 2/7 = 0.2857 \\ idf(trump) & = log(1+4/5) = 0.5878 \\ tfidf(trump, spon1) & = 0.2857 \cdot 0.5878 = 0.1679 \end{align}

tf-idf in practice

tf-idf can be used as feature matrix for:

  • Topic Modeling (Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NNMF), Latent Dirichlet Allocation [1])
  • Document similarity
  • Document clustering

[1]: Depends on implementation – Gibbs sampling based LDA does not work with continuous values, but other implementations (like in [gensim](http://radimrehurek.com/gensim/) based on the [online Variational Bayes](http://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf) approach) seem to [work](https://groups.google.com/forum/#!topic/gensim/OESG1jcaXaQ)