# Resources

Code, notebooks, and data I made openly available. Quick overview:

### Code

• Omesa - small framework for reproducible Text Mining research

• topbox - wrapper for Labelled Latent Dirichlet Allocation (L-LDA)

• markdoc - convert NumPy-styled Python docstring to Markdown

• ebacs - minimalistic conference manager

• ec2latex - XML to LaTeX book of abstracts

## Omesa

A small framework for reproducible Text Mining research that largely builds on top of scikit-learn. Its goal is to make common research procedures fully automated, optimized, and well recorded. To this end it features:

• Exhaustive search over best features, pipeline options, to classifier optimization.

• Flexible wrappers to plug in your tools and features of choice.

• Completely sparse pipeline through hashing - from data to feature space.

• Record of all settings and fitted parts of the entire experiment, promoting reproducibility.

• Dump an easily deployable version of the final model for plug-and-play demos.

End-to-End classification in 2 minutes:

from omesa.experimeomesant import Experiment
from omesa.featurizer import Ngrams

conf = {
"gram_experiment": {
"name": "gram_experiment",
"train_data": ["./n_gram.csv"],
"features": [Ngrams(level='char', n_list=[3])],
"text_column": 1,
"label_column": 0,
"folds": 10,
"save": ("log")
}
}

for experiment, configuration in conf.items():
Experiment(configuration)


Output:

---- Omesa ----

Config:

feature:   char_ngram
n_list:    [3]

name: gram_experiment
seed: 111

Sparse train shape: (20, 1287)

Tf-CV Result: 0.8


## markdoc

This piece of code can be used to convert NumPy-styled Python docstrings (example), such as those used in scikit-learn, to Markdown with minimum dependencies. In this way, only the code-contained documentation needs to be editted, and your documentation on for example readthedocs can be automatically updated thereafter with some configuration.

Simply type:

python3 markdoc.py /dir/to/somefile.py /dir/to/somedoc.md


## topbox

A small Python 3 wrapper around the Stanford Topic Modeling Toolbox (STMT) that makes working with L-LDA a bit easier; no need to leave the Python environment. More information on its workings can be found on my blog. Code sample:

## ebacs

in development

Minimal working version of a bottle.py front-end to ec2latex. Currently, it demonstrates how conference attendees can submit their abstracts (can include LaTeX code) via a form, after which this submission is added to the database. From the front page, the book of abstracts can be compiled per demonstration. The idea is that this functionality is embeddable for your conference website. Requires bottle, cork, beaker and blitzdb.

## ec2latex

Python tool for converting XML-based confrence submissions (such as EasyChair) to a full LaTeX book of abstracts. Sample of the end result can be found here. After manual work on the .tex files (can be found in github README), can be simply called with:

python ec2latex.py


This code has been integrated into ebacs.