Topbox - Wrapping Stanford's Topic Modelling Toolbox for Classification Purposes
For the line of research I started with my master thesis I made very frequent use of the Stanford Topic Modeling Toolbox (STMT). It is a very nifty module aimed to provide an interface to the Topic Models by David Blei: Latent Dirichlet Allocation and its Supervised variant. This toolbox was intended to make working with these models more accessible to researchers in the humanities and social sciences. It is therefore a standalone model that uses .csv
files structured in the correct format as input, and .scala
files for setting up the train - test routine. These can then be piped to the .java
bundle, which will again output a directory with a trained model and several .csv
s with its output. Pretty good stuff given that you do not want to use it for classification. The funny thing is, to date this toolbox is the only standalone implementation of Labelled-LDA, and it does not offer an intuitive interface for extracting results from a tested model. I figured, if one would require this, might as well just omit all the file hassle and make a nice script out of it.
Uh… what?
Topic Modelling (or Modeling in US spelling) is a Machine Learning and Natural Language Processing task. Given a big set of texts (for example Wikipedia articles), a Topic Model tries to automatically find the topics in these articles. Say that we give it articles on mathematics, music, and sports - we would hope that it at least captures these three topics. It might even be able to get a more fine-grained set of topics such as algebra, geometry, concerts, instruments, football, and rugby. So in an unsupervised setting, we might tell the model that we think we have about 20 topics in our entire set. It will then figure out what words characterize these 20 individual topics, how probable it is that these words occur under this topic, and how probable the topic occurrence is in the given set. A Topic Model thus sees a topic as a distribution over words, like so:
topic 12, p = 0.418502
Word | p |
---|---|
soccer | 0.300 |
rugby | 0.300 |
goal | 0.200 |
foul | 0.100 |
score | 0.100 |
Why is there ‘topic 12’ above a word distribution that we clearly can relate to sports? Well, we trained the model unsupervised; we didn’t tell it anything, just to extract 20 topics. It doesn’t know squat about what we see as topics and what topic names belong to which texts - it just allocates an arbitrary number. So what does it do then?
The intuition is that the model tries to find frequently co-occurring words, as they are argued to be likely to belong to the same topic. It also tries to find them unique to a certain topic; the words ‘it’ and ‘will’ are therefore likely to be deemed uninteresting throughout any of the topics. Topic Models are seen as generative models; by capturing how documents on certain topics could be written, one might combine topic distributions on Tv and Sports to generate a text talking about a sport commentary show. Well, in theory, that is - they don’t really write documents like a human would.
Topic Identification
When a Topic Model is trained supervised, the model has knowledge of what the documents you feed it are about. Typically, the input would have a (label, text) pair, like so:
sports, science sEMG signal is a one dimensional time series signal of
neuromuscular system recorded from skin surface...
Now the model knows to fit the words given in our document under these two topic labels. What this implies is that when it is presented with new data, it can actually give a list with probabilities per topic. So say we already have this trained model lying around, we would see a result such as:
1: sports, 0.381
2: science, 0.212
3: health, 0.196
4: ...
The great thing about this, is that one can evaluate the performance of a model with common ranking measurements such as Mean Average Precision. Provide the ranked set of labels, and compare it with the gold standard label(s) you already have, and it should return a score. In our case, we did pretty well and would score a $100\%$ on this particular instance (we even got the order right). So, we know how well our model is doing - and given that it performs nicely we can use it to classify new documents with one of the labels from our set.
Note: If you want a more in-depth introduction on these models, Chapter 3 in my master thesis (p. 17-33) explains it in more detail (bit of shameless self-promotion there).
topbox
The process described above can all be done, or is made easier with topbox
; a Python (2 & 3) wrapper around STMT. So let’s dig into its functionality; we already know what needs doing. We need a nice small train and test set and we should be good to go. As such:
train = [['sports football', 'about football, soccer, with a goal and a ball'],
['sports rugby', 'some text where we do a scrum and kick the ball'],
['music concerts', 'a venue with loud music and a stage'],
['music instruments', 'thing that have strings or keys, or whatever']]
test = [['music', 'the stage was full of string things'],
['sports', 'we kick a ball around'],
['rugby', 'some confusing sentence with novel words what is happening']]
With the example above things get a bit convoluted, because we actually need a split list with documents and a separate list for the labels. We can do that anyway, let’s first import and call the topbox
environment:
import topbox
stmt = topbox.STMT('test_model')
The module is going to store whatever model we train in /directory/to/topbox/box/
, under the name ‘test_model’. If you want to train and call it later, you can, but please be minded that it will consume disk space to keep the model stored in the box. We called STMT
completely parameter-less now, however. If we look at the documentation we get an idea of what the options are:
STMT.Parameters
name : string
The name that will be appended to all the saved files. If you want to
keep the trained model, this name can be used to load it back in.
epochs : integer, optional, default 20
The amount of iterations you want L-LDA to train and sample; if you
run into some errors, it's a good idea to set this to 1 to save time
whilst debugging.
mem : integer, optional, default 7000
The amount of memory (in MB) that the model will use. By default it
assumes that you have 8G of memory, so it will account for 1G of os
running. Should be comfortable; adjust if running into OutOfMemory
errors though.
keep : boolean, optional, default True
If set to False, will remove the data and scala files after training,
and will remove EVERYTHING after the resutls are obtained. This can
be handy when running a quick topic model and save disk space. If
you're running a big model and want to keep it after your session is
done, it might be better to just leave it to True.
So if we want to allocate more memory, and do for example 400 iterations, we call:
stmt = topbox.STMT('test_model', epochs=400, mem=14000)
Now we need to do a quick unzip for both of our sets, lets split them up in labels and space respectively:
train_labels, train_space = zip(*train)
test_labels, test_space = zip(*test)
Of course, if you already have these separately, no need to go through that hassle. So from here, it’s pretty straightforward, we just test and train:
stmt.train(train_space, train_labels)
stmt.test(test_space, test_labels)
If all is well, you should see your terminal call java
and do iterations of training, as well as reading in the items for testing. After, we can make topbox
retrieve the result. Now please note that we can either return the whole thing as a list (which might yield empty rows if a topic label to classify is not in your training set for example), which will keep the dependencies to the standard library of Python. Alternatively, you can make topbox
convert the whole thing to an array, which will require both numpy
and scipy
. Call the results, and dump these into y_true
and y_score
, providing the correct reference labels. As such:
# list
y_true, y_score = stmt.results(test_labels)
# array (sklearn ready)
y_true, y_score = stmt.results(test_labels, array=True)
Given that we go for the latter option, we can immediately evaluate the results with some evaluation metric of choice from sklearn
.
In [1]: from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_score)
Out[1]: 0.86888190259464704
After, if we do not want to use the model any more, we can simply get rid of it by calling:
stmt.cleanup()
Forgot the name, want to get rid off al the models you trained?
stmt.cleanup(all=True)
Applying trained topbox
To recap, here’s an example that quickly trains a model and tests if it works on the same data (please never do this other than for debugging purposes, see below).
import csv
%cd ~/Documents/data
csv_reader = csv.reader(open('train_data.csv'))
# relevant cells are in the 5th and 7th column
dat = [(x[5].lower(), x[7].lower()) for x in csv_reader]
%cd ~/Documents/
import topbox
stmt = topbox.STMT('bit_of_testing', epochs=10, mem=15000)
train_labels, train_space = zip(*dat)
stmt.train(train_space, train_labels)
stmt.test(train_labels, train_labels)
import numpy as np
from sklearn.metrics import average_precision_score
y_true, y_score = stmt.results(test_labels, array=True)
print(average_precision_score(y_true, y_score))
Now the model we trained is actually very much overfitted; we can’t assess how it generalizes topics because we give it the exact data that it already knows. Let’s try to train it tenfold cross-validation setting, like so:
from sklearn.metrics import average_precision_score
k = 10
n = int(len(dat)/k)
ap_tot = []
for i in range(0, len(test), n):
stmt = topbox.STMT('testing_cf_'+str(i), epochs=400, mem=15000)
# split lists
train = dat[:]
test = train[i:i+n]
train[i:i+n] = []
# train / test
train_labels, train_space = zip(*train)
stmt.train(train_space, train_labels)
test_labels, test_space = zip(*test)
stmt.test(test_space, test_labels)
# get scores
y_true_k, y_score_k = stmt.results(test_labels, array=True)
ap_tot.append(average_precision_score(y_true_k, y_score_k))
np.mean(ap_tot)
Now we actually get average performance on new data. And we’re done! I will try to upload the package to Github as soon as possible, improve the documentation, and provide some toy dataset to test the stuff with. Update will follow.