Scikit-learn Pipeline Persistence and JSON Serialization
First off, I would like to thank Sebastian Raschka, and Chris Wagner for providing the text and code that proved essential for writing this blog. Read the follow-up to this post here.
For some time now, I have been wanting to replace simply pickling my sklearn
pipelines. Pickle is incredibly convenient, but can be easy to corrupt, is not very transparent, and has compatibility issues. The latter has been quite a thorn in my side for several projects, and I stumbled upon it again while working on my own small text mining framework. Persistence is imperative when deploying a pipeline to a practical application like demo. Each piece of new data needs to be constructed in exactly the same vector size as it was offered in during development. Therefore, feature extraction, hashing, normalization, etc. has to be exactly the same, feeding data to the same model as after training. After reading Sebastian Raschka’s notebook on model persistence for scikit-learn, I figured I might give it a go myself.
Please note that all code is in Python 3.x, sklearn 0.17, and numpy 1.9.
Recap: Classifier to JSON
I also tried to use JSON as storage format. In addition, however, I aimed to store other parts of a pipeline as well. The biggest hurdles are definitely due to numpy
. These special Python objects cannot be serialized in JSON, as it is limited to at most bool
, int
, float
, and str
for data types and list
, and dict
for structures. Following Sebastian’s notes, I first tried to reproduce this to store classifiers. For trained models, we can access the parameters by get_params
, and fit information in the class attributes (e.g. classes_
, intercept_
for LogisticRegression
). Alternatively, we can just store all class information as follows:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
lr = LogisticRegression(multi_class='multinomial', solver='newton-cg')
lr.fit(X, y)
attr = lr.__dict__
attr
# output ---------
{'C': 1.0,
'class_weight': None,
'classes_': array([0, 1, 2]),
'coef_': array([[-0.42363867, 0.96158336, -2.5193416 , -1.08640712],
[ 0.5342659 , -0.31758963, -0.2054791 , -0.9392839 ],
[-0.11062723, -0.64399373, 2.7248207 , 2.02569101]]),
'dual': False,
'fit_intercept': True,
'intercept_': array([ 9.88272104, 2.21749429, -12.10021533]),
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'multinomial',
'n_iter_': array([20], dtype=int32),
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
Great, so the _
-affixed keys are fit-parameters, whereas the rest are model parameters. The first issue arises here, which is that some of our values have a numpy.array
that is incompatible with JSON. These are pretty straight-forward to serialize, we can simply convert them to a list:
import json
for k, v in attr.items():
if isinstance(v, list) and v[-1:] == '_':
attr[k] = v.tolist()
json.dump(attr, open('./attributes.json', 'w'))
And sure enough, if we port these back to a new instance of the LogisticRegression
class we are good to go:
lr2 = LogisticRegression()
for k, v in attr.items():
if isinstance(v, list):
setattr(lr2, k, np.array(v))
else:
setattr(lr2, k, v)
lr2.predict(X) # just for testing :)
Sadly, life isn’t always this easy.
Problem: Pipeline to JSON
In a broader scenario, one might use other sklearn
classes to create a fancy data-to-prediction pipeline. Say that we want to accept some text input, and generate $n$-gram features. I wrote about using the DictVectorizer
for efficient gram extraction in my previous post, so I’ll use it here:
from collections import Counter
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter([gram for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
Assume we have some form that accepts user input, represented by text_input
, and our training data corpus
. First we extract features and fit the vectorizer:
from sklearn.feature_extraction import DictVectorizer
corpus = ["this is an example", "hey more examples", "can we get more examples"]
text_input = "hey can I get more examples"
vec = DictVectorizer().fit([extract_grams(s, [2]) for s in corpus])
print(vec.transform(extract_grams(text_input, [2])))
# output ---------
(0, 2) 1.0
(0, 5) 1.0
Sweet, the vectorizer works. Now it can be serialized as before, right?
vec_attr = vec.__dict__
for k, v in vec_attr.items():
if isinstance(v, list) and v[-1:] == '_':
vec_attr[k] = v.tolist()
json.dump(vec_attr, open('./vec_attributes.json', 'w'))
# output ---------
TypeError: key ('more', 'examples') is not a string
Nope. The tuples used to fit the vectorizer are not in the data types accepted by JSON. Ok, no problem, we just alter the extract_grams
function again to concatenate them to a string and run it again:
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter(['_'.join(list(gram)) for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
vec = DictVectorizer().fit([extract_grams(s, [2]) for s in corpus])
vec_attr = vec.__dict__
for k, v in vec_attr.items():
if isinstance(v, list) and v[-1:] == '_':
vec_attr[k] = v.tolist()
json.dump(vec_attr, open('./vec_attributes.json', 'w'))
# output ---------
TypeError: <class 'numpy.float64'> is not JSON serializable
Uh oh.
Serializing Most of Numpy
Life is not simple, and neither is scikit-learn. Actually, from a range of pipeline pieces I have tested, there are many different sources that throw JSON serialization errors. These can be variables that store types, or any other numpy
data format (np.int32
and np.float64
are both used in LinearSVC
for example). While some objects have a (limited) python object representation, one of the harder cases was the error thrown by the DictVectorizer
. To convert a numpy
type object, the following is required:
target = np.float64
serialisation = target.__name__
deserialisation = np.dtype(serialisation).type
print(target, serialisation, deserialisation)
# output ---------
<class 'numpy.float64'> float64 <class 'numpy.float64'>
So, we actually need a couple of functions that can serialize
an entire dictionary with python and numpy
objects, and then deserialize
when we need it again. I was very much helped by Chris Wagner’s blog, who already provides quite a big code snippet that does exactly this. I inserted the following lines myself:
def serialize(data):
...
if isinstance(data, type):
return {"py/numpy.type": data.__name__}
if isinstance(data, np.integer):
return {"py/numpy.int": int(data)}
if isinstance(data, np.float):
return {"py/numpy.float": data.hex()}
...
def deserialize(data):
...
if "py/numpy.type" in dct:
return np.dtype(dct["py/numpy.type"]).type
if "py/numpy.int" in dct:
return np.int32(dct["py/numpy.int"])
if "py/numpy.float" in dct:
return np.float64.fromhex(dct["py/numpy.float"])
...
This even retains the floating point precisions by hexing them for serialization. So using these scripts, we can run the full pipeline by importing Chris’ script with my alterations as serialize_json
. First we fit our amazing corpus again, and train the model:
import json
import numpy as np
import serialize_sk as sr
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
corpus = ["this is an example", "hey more examples", "can we get more examples"]
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter(['_'.join(list(gram)) for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
vec = DictVectorizer()
D = vec.fit_transform([extract_grams(s, [2]) for s in corpus])
svm = LinearSVC()
svm.fit(D, [1, 0, 1])
atb_vec = vec.__dict__
atb_clf = svm.__dict__
Serialize the vectorizer and model:
def serialize(d, name):
for k, v in d.items():
d[k] = sr.data_to_json(v)
json.dump(d, open(name + '.json', 'w'))
serialize(atb_clf, 'clf')
serialize(atb_vec, 'vec')
Now we assume that this a new application. First, we load the .json
s and deserialize:
new_vec = json.load(open('vec.json'))
new_clf = json.load(open('clf.json'))
def deserialize(class_init, attr):
for k, v in attr.items():
setattr(class_init, k, sr.json_to_data(v))
return class_init
vec2 = deserialize(DictVectorizer(), new_vec)
svm2 = deserialize(LinearSVC(), new_clf)
And finally we accept user input, and give back a classification label:
user_input = "hey can I get more examples"
grams = vec2.transform(extract_grams(user_input, [2]))
print(grams, "\n")
print(svm2.predict(grams))
# output ---------
(0, 2) 1.0
(0, 5) 1.0
[1]
And it works!
Conclusion
Chances are that when using different classes in sklearn
, other issues might present themselves. However, for now I’ve got my most used pieces covered. It will probably mostly entail refining serialize_json
. Of course, even when using JSON there is no protection from the fact that parameters might be changed in different version of scikit-learn. At least now the JSONs stored with old versions are transparent enough to be easily modifiable. Any suggestions and or improvements are obviously more than welcome. I hereby also provide my version of Chris Wagner’s script, as well as a Jupyter notebook.