- Mar 25, 2017
This post was written as a reply to a question asked in the Social Data Mining course.
- Apr 14, 2016
This is a follow-up to this post.
- Apr 11, 2016
First off, I would like to thank Sebastian Raschka, and Chris Wagner for providing the text and code that proved essential for writing this blog. Read the follow-up to this post here.
- Dec 02, 2015
Recently I have been experimenting with (pretty) fast $n$-gram extraction for feature space construction. As I have clearly experienced, there are a bunch of caveats while building your own functions. This blog post gives a very short introduction to $n$-grams, explains their extraction in examples as well as in code (section 2). And finally gives a very effective and low memory footprint method of extracting them (section 3).
- Sep 23, 2015
At some point in a webdev’s life, one might consider moving away from classic web development in for example PHP, and move on to more convenient frameworks. If you already know a language, learning another one just for web development can also seem like a waste of time. Alternatives might be more elegant, flexible, and better integrated with the things you’re currently writing. I found myself in the position of having to demo my work on author profiling that was completely written in Python 3 (
sklearn). The problem with framework switching, I found, is learning how to do frustratingly arbitrary operations all over again. Moreover, however sluggish LAMP set-ups might seem, they are very robust, well-understood and have plenty of documentation. Changing to more obscure environments like I did with
bottle.pycan shovel you in the face at any point. It’s been a bumpy road to say the least, so to help any pursuers of this path in the future, I hereby present you with my findings thus far.
- Sep 07, 2015
Python 3 has been out for quite some time, and it’s still notoriously ignored by the majority of the community. Luckily, all major libraries that I make use of have already made the small leap and as behaviour has only been slightly changed I never ran into any Py3 specific issues. Until today, that is. After writing quite the collection of classes, and storing their initialized states along with several module-specific objects all in one container, it was ready to be pickled and transferred. My fingers being crossed and naive high hopes notwithstanding, after a few lines of log code rolled over the screen the following presented itself:
- Jun 17, 2015
For the line of research I started with my master thesis I made very frequent use of the Stanford Topic Modeling Toolbox (STMT). It is a very nifty module aimed to provide an interface to the Topic Models by David Blei: Latent Dirichlet Allocation and its Supervised variant. This toolbox was intended to make working with these models more accessible to researchers in the humanities and social sciences. It is therefore a standalone model that uses
.csvfiles structured in the correct format as input, and
.scalafiles for setting up the train - test routine. These can then be piped to the
.javabundle, which will again output a directory with a trained model and several
.csvs with its output. Pretty good stuff given that you do not want to use it for classification. The funny thing is, to date this toolbox is the only standalone implementation of Labelled-LDA, and it does not offer an intuitive interface for extracting results from a tested model. I figured, if one would require this, might as well just omit all the file hassle and make a nice script out of it.
- May 08, 2015
Lately I have been collecting a large amount of tweets for building a good representation of Twitter-user’s expected social discourse and its meta-data. Basically, a fancy way of saying that I want to see who publicly shares what, and with whom. After some digging around, I settled for Tweepy to interface with the Twitter API. There were several scenarios which I was looking to implement: grab the available associates (followers, friends) and public timeline given a user’s name, and resolving a large number of tweets given a set of tweet IDs. Don’t get me wrong, Tweepy offers a very nice interface. It was a bit too general-purpose for my liking though, so I started building a wrapper class around Tweepy. In this post, I will talk a bit about its functionality, considerations and future improvements while discussing the task of utilizing the Twitter API for Natural Language Processing-related research.
- Jan 14, 2015
Over the last few weeks, I have been working on a
pythonscript that can convert an
XMLfile generated by EasyChair to a Book of Abstracts used at conferences. One of the problems I faced was that I had to create a custom Table of Contents with the title of the submitted abstracts and their page number, and its authors below.
- Jan 14, 2015
Welcome to my blog, which I intend to use to spend some time elucidating my programming-related activities. As a Ph.D. in computational linguistics and machine learning, and a ‘casual’ web developer, I regularly work with languages such as Python, LaTeX, R, and PHP. Having had much aid during my many coding sessions from various blogs and Q&A’s, I decided that it might be good to give back to the community, so to say. I will try to mainly focus on interesting snippets rather than more introductory stuff, which you will hopefully find informative.