(Pretty) Fast $n$-Gram Features with Python

Recently I have been experimenting with (pretty) fast $n$-gram extraction for feature space construction. As I have clearly experienced, there are a bunch of caveats while building your own functions. This blog post gives a very short introduction to $n$-grams, explains their extraction in examples as well as in code (section 2). And finally gives a very effective and low memory footprint method of extracting them (section 3).

img

Extracting $n$-Grams

The extraction of $n$-grams is a commonly used as a so-called bag-of-words method used in Natural Language Processing (NLP) to represent a collection of documents. In a very basic example we want to either distinguish or relate sentence $A$, $B$, and $C$ which are the following:

A = 'text about stuff'
B = 'stuff about text'
C = 'text about n-grams'

The $n$ in $n$-grams typically refers to a scope. One can look at it as going over the sentence with a window of some size. So with a 1-gram (referred to as a unigram) we partition the text in chunks of size one. So sentence $A$ would give us text, about, and stuff. If we would put this into a matrix of gram * sentence, we would get the following for the three sentences:

 $A$$B$$C$
text111
about111
stuff110
n-grams001

Read as: the gram text occurs in text $A$ one time, while n-grams does not occur in $A$.

Looking at the matrix, one can quickly see that despite the order of words, sentence $A$ and $B$ look the same. If we take their vector they can be observed to be identical:

$\vec{A} = [1, 1, 1, 0], \vec{B} = [1, 1, 1, 0]$

This is is what it means to have a bag-of-words. Words are thrown into a bag, scrambled in any order and counted. Order does not matter to the model. So how can we incorporate some structure into it? Taking word bi-grams, where $n = 2$ would probably already help our case. Consider:

 $A$$B$$C$
text about101
about stuff100
stuff about010
about text010
about n-grams001

Now the vectors of $A$, $B$ and $C$ are all unique. Moreover, it can actually be inferred that $A$ and $C$ have some relation, but they both don’t relate to $B$. Done right? Well, not quite. What if we introduce a fourth sentence:

D = 'n-grams are handy'

We construct the matrix again:

 $A$$B$$C$$D$
text about1010
about stuff1000
stuff about0100
about text0100
about n-grams0010
n-grams are0001
are handy0001

We could say that $D$ relates to $C$, but it doesn’t show with this method. With our uni-gram method it might, but we run into trouble with $A$ and $B$ being identical again. Instead of words, we might draw more information from the character combinations instead. Let’s take tri-grams (3-grams), and see:

 $A$$B$$C$$D$
tex1110
ext1110
xta1010
tab1010
abo1110
bou1110
out1110
tst1000
uts1000
stu1100
tuf1100
uff1100
ffa0100
fab0100
utt0100
tte0100
utn0010
tn-0010
n-g0011
-gr0011
gra0011
ram0011
ams0011
msa0001
sar0001
are0001
reh0001
eha0001
han0001
and0001
ndy0001

Note that I removed spaces to avoid the matrix blowing up even more.

Now, the overlap between the different sentences is more evident, but with an explosion of the matrix as a result. As can be observed, different grams vary in usefulness; for sentence relatedness, tri-grams might more useful - however, they obfuscate the actual word usage. The latter can be seen as problematic when the task is to distinguish certain topics. For example, if we want to measure the relatedness of sentences $A$ through $D$ to NLP, it’s more effective to represent n-grams as a full chunk rather than ['n-g', '-gr', 'gra', etc.]. The explosion of the amount of grams per sentence makes it harder for algorithms to uncover relations; as it would have to rely on a combination of the grams for information, rather than their single occurrences. Another caveat is the sparseness of the vectors: the rarer the occurrences, the more zeroes are present in sentence vectors, the less information a vector provides per bit. In the above example this is already pretty evident from a very small example, imagine a corpus of a million documents. How one would go abouts programming this in an effective manner is the topic of this particular post.

Method 1 - a Little Naive, But Dependency Free

First we will consider implementing a simple sliding window of $n$ to extract the grams from a sentence. Such that, given sentence $A$ again, we get the grams as we saw in the previous example. So something like:

In [1]: find_ngrams('text about stuff', n=2)
Out[1]: ['text about', 'about stuff']

Or even:

In [1]: find_ngrams('text about stuff', n_list=[1, 2])
Out[1]: ['text', 'about', 'stuff', 'text about', 'about stuff']

An effective algorithm as proven by this post is to abuse zip and slicing in python. So:

def find_ngrams(sentence, n):
    """Magic n-gram function."""
    inp = sentence.split()
    return zip(*[inp[i:] for i in range(n)])

The above method will give us a list of tuples of size $n$, extracted in a sliding window. If we alter the function slightly, we will be able to achieve the desired result:

def find_ngrams(sentence, n_list):
    """Magic n-gram function."""
    inp, grams = sentence.split(), []
    for n in n_list:
      grams += [' '.join(x) for x in zip(*[inp[i:] for i in range(n)])]
    return grams

Now we should be able to turn sentences into vectors representing the gram occurrences in a sentence. So that the the a a thing would at least yield [2, 2, 1]. This entails incorporating the search function into a neat class that can fit the known grams and make sure their index in the vector is the same for all sentences. A very compact class doing exactly this would look something like:

class Ngrams:

    def __init__(self, n_list):
        self.n_list = n_list
        self.indices = {}

    def fit(self, sentence):
        """Magic n-gram function fits to vector indices."""
        i, inp = len(self.indices)-1, sentence.split()
        for n in self.n_list:
            for x in zip(*[inp[i:] for i in range(n)]):
                if self.indices.get(x) == None:
                    i += 1
                    self.indices.update({x: i})

    def transform(self, sentence):
        """Given a sentence, convert to a gram vector."""
        v, inp = [0] * len(self.indices), sentence.split()
        for n in self.n_list:
            for x in zip(*[inp[i:] for i in range(n)]):
                if self.indices.get(x) != None:
                    v[self.indices[x]] += 1
        return v

First we call the fit method to extract the n-grams and index them to self.indices. Next time the function sees a sentences, it will know where in a vector to place the frequencies, as well as which words are not part of that vector. This can be seen in the transform part where it begins with an empty vector of size self.indices and starts filling in the frequencies. A major drawback of this approach is that the corpus needs to be iterated over twice; once for extracting all possible $n$-grams, and once this is known, another passover to convert all sentences to vectors. Still, this works, proven by the example below. Please pay close attention to the dict.get == None or != None parts. Given that we have a {gram: index} dictionary, a simple if self.indices.get would not pass a 0 index, since Python sees that as False.

In [1]: ng = Ngrams(n_list=[1])

In [2]: ng.fit('text about stuff')

In [3]: ng.transform('text about stuff')
Out[3]: [1, 1, 1]

In [4]: ng.fit('n-grams are handy')

In [5]: ng.transform('text about n-grams')
Out[5]: [1, 1, 0, 1, 0, 0]

In [6]: ng.indices
Out[6]:
{('about',): 1,
 ('are',): 4,
 ('handy',): 5,
 ('n-grams',): 3,
 ('stuff',): 2,
 ('text',): 0}

To counter the complexity issue from passing over whatever copora twice, it would be a lot better if a vector or even a matrix could be constructed from the frequencies that we can already extract at the fit step. For this, a feature hasher such as the one implemented in sklearn, or the dict vectorizer would be of great use, as will become clear after the next section.

Scikit-learn is a very extensive toolkit for machine learning in Python. In addition to classifiers, it also provides tools for feature extraction, evaluation, optimization, etc. The FeatureHasher uses what is known as the hashing trick. If you’re interested in its workings, it is very well explained in this blog.

Method 2 - Sentences to Sparse Vectors

If some refactoring is done on the previous class, we could come up with the following, very minimal function:

from collections import Counter

def extract_grams(sentence, n_list):
    inp = sentence.split()
    return Counter([' '.join(gram) for x in zip(*[inp[i:]
                    for i in range(n)]) for n in n_list])

This does exactly what we want:

In [1]: extract_grams('this is some text about text this is', n_list=[1, 2])
Out[1]: Counter({('about',): 1,
         ('about', 'text'): 1,
         ('is',): 2,
         ('is', 'some'): 1,
         ('some',): 1,
         ('some', 'text'): 1,
         ('text',): 2,
         ('text', 'about'): 1,
         ('text', 'this'): 1,
         ('this',): 2,
         ('this', 'is'): 2})

We can feed this to the FeatureHasher so that it is transformed to a sparse space. This approach is clever in a number of ways: (1) $n$-grams can be extracted iteratively and therefore your corpus does not need to be in memory. One could for example read from a .csv and fit the hasher like so:

import csv
from sklearn.feature_extraction import FeatureHasher
from collections import Counter

def extract_grams(sentence, n_list):
    inp = sentence.split()
    return Counter([' '.join(gram) for x in zip(*[inp[i:]
                    for i in range(n)]) for n in n_list])

reader, hasher = csv.reader(open('some.csv', 'r')), FeatureHasher()
X = hasher.fit_transform([extract_grams(sentence, n_list=[1, 2])
                          for sentence in reader])

Note that the ' '.join is necessary because FeatureHasher only handles strings as features (DictVectorizer can deal with tuples though). (2) the iteration only has to be done once; therefore, you corpus does not have to be stored beforehand. In this sense, it can (3) be applied for fast online learning purposes seen in for example Vowpal Wabbit and Torch. So there you have it, a self-extensible, pretty minimal way to quickly and memory efficiently get a bag-of-words space.