Resources

Code, notebooks, and data I made openly available. Quick overview:

Code


Omesa

 

A small framework for reproducible Text Mining research that largely builds on top of scikit-learn. Its goal is to make common research procedures fully automated, optimized, and well recorded. To this end it features:

End-to-End classification in 2 minutes:

from omesa.experimeomesant import Experiment
from omesa.featurizer import Ngrams

conf = {
    "gram_experiment": {
        "name": "gram_experiment",
        "train_data": ["./n_gram.csv"],
        "has_header": True,
        "features": [Ngrams(level='char', n_list=[3])],
        "text_column": 1,
        "label_column": 0,
        "folds": 10,
        "save": ("log")
    }
}

for experiment, configuration in conf.items():
    Experiment(configuration)

Output:

---- Omesa ----

 Config:

        feature:   char_ngram
        n_list:    [3]

    name: gram_experiment
    seed: 111

 Sparse train shape: (20, 1287)

 Tf-CV Result: 0.8

markdoc

This piece of code can be used to convert NumPy-styled Python docstrings (example), such as those used in scikit-learn, to Markdown with minimum dependencies. In this way, only the code-contained documentation needs to be editted, and your documentation on for example readthedocs can be automatically updated thereafter with some configuration.

Simply type:

python3 markdoc.py /dir/to/somefile.py /dir/to/somedoc.md

topbox

A small Python 3 wrapper around the Stanford Topic Modeling Toolbox (STMT) that makes working with L-LDA a bit easier; no need to leave the Python environment. More information on its workings can be found on my blog. Code sample:

import topbox

stmt = topbox.STMT('bit_of_testing', epochs=10, mem=15000)

X = ['text text more text', 'things to do with text']
y = ['label1 label2', 'label1 label3']

stmt.train(space, labels)

infer = ['this is a text', 'things with more text']
gs = ['label1 label2', 'label1 label3']

stmt.test(infer, gs)

y_true, y_score = stmt.results(gs, array=True)
print(average_precision_score(y_true, y_score))

ebacs

  in development

Minimal working version of a bottle.py front-end to ec2latex. Currently, it demonstrates how conference attendees can submit their abstracts (can include LaTeX code) via a form, after which this submission is added to the database. From the front page, the book of abstracts can be compiled per demonstration. The idea is that this functionality is embeddable for your conference website. Requires bottle, cork, beaker and blitzdb.

twitter

ec2latex

Python tool for converting XML-based confrence submissions (such as EasyChair) to a full LaTeX book of abstracts. Sample of the end result can be found here. After manual work on the .tex files (can be found in github README), can be simply called with:

python ec2latex.py

This code has been integrated into ebacs.