Code, notebooks, and data I made openly available. Quick overview:




A small framework for reproducible Text Mining research that largely builds on top of scikit-learn. Its goal is to make common research procedures fully automated, optimized, and well recorded. To this end it features:

End-to-End classification in 2 minutes:

from omesa.experimeomesant import Experiment
from omesa.featurizer import Ngrams

conf = {
    "gram_experiment": {
        "name": "gram_experiment",
        "train_data": ["./n_gram.csv"],
        "has_header": True,
        "features": [Ngrams(level='char', n_list=[3])],
        "text_column": 1,
        "label_column": 0,
        "folds": 10,
        "save": ("log")

for experiment, configuration in conf.items():


---- Omesa ----


        feature:   char_ngram
        n_list:    [3]

    name: gram_experiment
    seed: 111

 Sparse train shape: (20, 1287)

 Tf-CV Result: 0.8


This piece of code can be used to convert NumPy-styled Python docstrings (example), such as those used in scikit-learn, to Markdown with minimum dependencies. In this way, only the code-contained documentation needs to be editted, and your documentation on for example readthedocs can be automatically updated thereafter with some configuration.

Simply type:

python3 /dir/to/ /dir/to/


A small Python 3 wrapper around the Stanford Topic Modeling Toolbox (STMT) that makes working with L-LDA a bit easier; no need to leave the Python environment. More information on its workings can be found on my blog. Code sample:

import topbox

stmt = topbox.STMT('bit_of_testing', epochs=10, mem=15000)

X = ['text text more text', 'things to do with text']
y = ['label1 label2', 'label1 label3']

stmt.train(space, labels)

infer = ['this is a text', 'things with more text']
gs = ['label1 label2', 'label1 label3']

stmt.test(infer, gs)

y_true, y_score = stmt.results(gs, array=True)
print(average_precision_score(y_true, y_score))


  in development

Minimal working version of a front-end to ec2latex. Currently, it demonstrates how conference attendees can submit their abstracts (can include LaTeX code) via a form, after which this submission is added to the database. From the front page, the book of abstracts can be compiled per demonstration. The idea is that this functionality is embeddable for your conference website. Requires bottle, cork, beaker and blitzdb.



Python tool for converting XML-based confrence submissions (such as EasyChair) to a full LaTeX book of abstracts. Sample of the end result can be found here. After manual work on the .tex files (can be found in github README), can be simply called with:


This code has been integrated into ebacs.