(Pretty) Fast $n$-Gram Features with Python
Recently I have been experimenting with (pretty) fast $n$-gram extraction for feature space construction. As I have clearly experienced, there are a bunch of caveats while building your own functions. This blog post gives a very short introduction to $n$-grams, explains their extraction in examples as well as in code (section 2). And finally gives a very effective and low memory footprint method of extracting them (section 3).
Extracting $n$-Grams
The extraction of $n$-grams is a commonly used as a so-called bag-of-words method used in Natural Language Processing (NLP) to represent a collection of documents. In a very basic example we want to either distinguish or relate sentence $A$, $B$, and $C$ which are the following:
A = 'text about stuff'
B = 'stuff about text'
C = 'text about n-grams'
The $n$ in $n$-grams typically refers to a scope. One can look at it as going over the sentence with a window of some size. So with a 1-gram (referred to as a unigram) we partition the text in chunks of size one. So sentence $A$ would give us text
, about
, and stuff
. If we would put this into a matrix of gram * sentence, we would get the following for the three sentences:
$A$ | $B$ | $C$ | |
---|---|---|---|
text | 1 | 1 | 1 |
about | 1 | 1 | 1 |
stuff | 1 | 1 | 0 |
n-grams | 0 | 0 | 1 |
Read as: the gram
text
occurs in text $A$ one time, whilen-grams
does not occur in $A$.
Looking at the matrix, one can quickly see that despite the order of words, sentence $A$ and $B$ look the same. If we take their vector they can be observed to be identical:
$\vec{A} = [1, 1, 1, 0], \vec{B} = [1, 1, 1, 0]$
This is is what it means to have a bag-of-words. Words are thrown into a bag, scrambled in any order and counted. Order does not matter to the model. So how can we incorporate some structure into it? Taking word bi-grams, where $n = 2$ would probably already help our case. Consider:
$A$ | $B$ | $C$ | |
---|---|---|---|
text about | 1 | 0 | 1 |
about stuff | 1 | 0 | 0 |
stuff about | 0 | 1 | 0 |
about text | 0 | 1 | 0 |
about n-grams | 0 | 0 | 1 |
Now the vectors of $A$, $B$ and $C$ are all unique. Moreover, it can actually be inferred that $A$ and $C$ have some relation, but they both don’t relate to $B$. Done right? Well, not quite. What if we introduce a fourth sentence:
D = 'n-grams are handy'
We construct the matrix again:
$A$ | $B$ | $C$ | $D$ | |
---|---|---|---|---|
text about | 1 | 0 | 1 | 0 |
about stuff | 1 | 0 | 0 | 0 |
stuff about | 0 | 1 | 0 | 0 |
about text | 0 | 1 | 0 | 0 |
about n-grams | 0 | 0 | 1 | 0 |
n-grams are | 0 | 0 | 0 | 1 |
are handy | 0 | 0 | 0 | 1 |
We could say that $D$ relates to $C$, but it doesn’t show with this method. With our uni-gram method it might, but we run into trouble with $A$ and $B$ being identical again. Instead of words, we might draw more information from the character combinations instead. Let’s take tri-grams (3-grams), and see:
$A$ | $B$ | $C$ | $D$ | |
---|---|---|---|---|
tex | 1 | 1 | 1 | 0 |
ext | 1 | 1 | 1 | 0 |
xta | 1 | 0 | 1 | 0 |
tab | 1 | 0 | 1 | 0 |
abo | 1 | 1 | 1 | 0 |
bou | 1 | 1 | 1 | 0 |
out | 1 | 1 | 1 | 0 |
tst | 1 | 0 | 0 | 0 |
uts | 1 | 0 | 0 | 0 |
stu | 1 | 1 | 0 | 0 |
tuf | 1 | 1 | 0 | 0 |
uff | 1 | 1 | 0 | 0 |
ffa | 0 | 1 | 0 | 0 |
fab | 0 | 1 | 0 | 0 |
utt | 0 | 1 | 0 | 0 |
tte | 0 | 1 | 0 | 0 |
utn | 0 | 0 | 1 | 0 |
tn- | 0 | 0 | 1 | 0 |
n-g | 0 | 0 | 1 | 1 |
-gr | 0 | 0 | 1 | 1 |
gra | 0 | 0 | 1 | 1 |
ram | 0 | 0 | 1 | 1 |
ams | 0 | 0 | 1 | 1 |
msa | 0 | 0 | 0 | 1 |
sar | 0 | 0 | 0 | 1 |
are | 0 | 0 | 0 | 1 |
reh | 0 | 0 | 0 | 1 |
eha | 0 | 0 | 0 | 1 |
han | 0 | 0 | 0 | 1 |
and | 0 | 0 | 0 | 1 |
ndy | 0 | 0 | 0 | 1 |
Note that I removed spaces to avoid the matrix blowing up even more.
Now, the overlap between the different sentences is more evident, but with an explosion of the matrix as a result. As can be observed, different grams vary in usefulness; for sentence relatedness, tri-grams might more useful - however, they obfuscate the actual word usage. The latter can be seen as problematic when the task is to distinguish certain topics. For example, if we want to measure the relatedness of sentences $A$ through $D$ to NLP, it’s more effective to represent n-grams
as a full chunk rather than ['n-g', '-gr', 'gra', etc.]
. The explosion of the amount of grams per sentence makes it harder for algorithms to uncover relations; as it would have to rely on a combination of the grams for information, rather than their single occurrences. Another caveat is the sparseness of the vectors: the rarer the occurrences, the more zeroes are present in sentence vectors, the less information a vector provides per bit. In the above example this is already pretty evident from a very small example, imagine a corpus of a million documents. How one would go abouts programming this in an effective manner is the topic of this particular post.
Method 1 - a Little Naive, But Dependency Free
First we will consider implementing a simple sliding window of $n$ to extract the grams from a sentence. Such that, given sentence $A$ again, we get the grams as we saw in the previous example. So something like:
In [1]: find_ngrams('text about stuff', n=2)
Out[1]: ['text about', 'about stuff']
Or even:
In [1]: find_ngrams('text about stuff', n_list=[1, 2])
Out[1]: ['text', 'about', 'stuff', 'text about', 'about stuff']
An effective algorithm as proven by this post is to abuse zip and slicing in python. So:
def find_ngrams(sentence, n):
"""Magic n-gram function."""
inp = sentence.split()
return zip(*[inp[i:] for i in range(n)])
The above method will give us a list of tuples of size $n$, extracted in a sliding window. If we alter the function slightly, we will be able to achieve the desired result:
def find_ngrams(sentence, n_list):
"""Magic n-gram function."""
inp, grams = sentence.split(), []
for n in n_list:
grams += [' '.join(x) for x in zip(*[inp[i:] for i in range(n)])]
return grams
Now we should be able to turn sentences into vectors representing the gram occurrences in a sentence. So that the the a a thing
would at least yield [2, 2, 1]
. This entails incorporating the search function into a neat class that can fit the known grams and make sure their index in the vector is the same for all sentences. A very compact class doing exactly this would look something like:
class Ngrams:
def __init__(self, n_list):
self.n_list = n_list
self.indices = {}
def fit(self, sentence):
"""Magic n-gram function fits to vector indices."""
i, inp = len(self.indices)-1, sentence.split()
for n in self.n_list:
for x in zip(*[inp[i:] for i in range(n)]):
if self.indices.get(x) == None:
i += 1
self.indices.update({x: i})
def transform(self, sentence):
"""Given a sentence, convert to a gram vector."""
v, inp = [0] * len(self.indices), sentence.split()
for n in self.n_list:
for x in zip(*[inp[i:] for i in range(n)]):
if self.indices.get(x) != None:
v[self.indices[x]] += 1
return v
First we call the fit
method to extract the n-grams and index them to self.indices
. Next time the function sees a sentences, it will know where in a vector to place the frequencies, as well as which words are not part of that vector. This can be seen in the transform
part where it begins with an empty vector of size self.indices
and starts filling in the frequencies. A major drawback of this approach is that the corpus needs to be iterated over twice; once for extracting all possible $n$-grams, and once this is known, another passover to convert all sentences to vectors. Still, this works, proven by the example below. Please pay close attention to the dict.get == None
or != None
parts. Given that we have a {gram: index}
dictionary, a simple if self.indices.get
would not pass a 0
index, since Python sees that as False
.
In [1]: ng = Ngrams(n_list=[1])
In [2]: ng.fit('text about stuff')
In [3]: ng.transform('text about stuff')
Out[3]: [1, 1, 1]
In [4]: ng.fit('n-grams are handy')
In [5]: ng.transform('text about n-grams')
Out[5]: [1, 1, 0, 1, 0, 0]
In [6]: ng.indices
Out[6]:
{('about',): 1,
('are',): 4,
('handy',): 5,
('n-grams',): 3,
('stuff',): 2,
('text',): 0}
To counter the complexity issue from passing over whatever copora twice, it would be a lot better if a vector or even a matrix could be constructed from the frequencies that we can already extract at the fit
step. For this, a feature hasher such as the one implemented in sklearn, or the dict vectorizer would be of great use, as will become clear after the next section.
Scikit-learn is a very extensive toolkit for machine learning in Python. In addition to classifiers, it also provides tools for feature extraction, evaluation, optimization, etc. The FeatureHasher uses what is known as the hashing trick. If you’re interested in its workings, it is very well explained in this blog.
Method 2 - Sentences to Sparse Vectors
If some refactoring is done on the previous class, we could come up with the following, very minimal function:
from collections import Counter
def extract_grams(sentence, n_list):
inp = sentence.split()
return Counter([' '.join(gram) for n in n_list
for gram in zip(*[inp[i:] for i in range(n)])])
This does exactly what we want:
In [1]: extract_grams('this is some text about text this is', n_list=[1, 2])
Out[1]: Counter({('about',): 1,
('about', 'text'): 1,
('is',): 2,
('is', 'some'): 1,
('some',): 1,
('some', 'text'): 1,
('text',): 2,
('text', 'about'): 1,
('text', 'this'): 1,
('this',): 2,
('this', 'is'): 2})
We can feed this to the FeatureHasher
so that it is transformed to a sparse space. This approach is clever in a number of ways: (1) $n$-grams can be extracted iteratively and therefore your corpus does not need to be in memory. One could for example read from a .csv
and fit the hasher like so:
import csv
from sklearn.feature_extraction import FeatureHasher
from collections import Counter
def extract_grams(sentence, n_list):
inp = sentence.split()
return Counter([' '.join(gram) for n in n_list
for gram in zip(*[inp[i:] for i in range(n)])])
reader, hasher = csv.reader(open('some.csv', 'r')), FeatureHasher()
X = hasher.fit_transform([extract_grams(sentence, n_list=[1, 2])
for sentence in reader])
Note that the ' '.join
is necessary because FeatureHasher
only handles strings as features (DictVectorizer
can deal with tuples though). (2) the iteration only has to be done once; therefore, you corpus does not have to be stored beforehand. In this sense, it can (3) be applied for fast online learning purposes seen in for example Vowpal Wabbit and Torch. So there you have it, a self-extensible, pretty minimal way to quickly and memory efficiently get a bag-of-words space.