Scikit-learn Pipeline Persistence and JSON Serialization Part II
This is a follow-up to this post.
In my last entry, I wrote about several hurdles on the way to replacing pickle with JSON for storing scikit-learn pipelines. While my previous solution was satisfactory for handling a class per file, storing an entire pipeline introduces more complexity than I previously assumed. In this follow-up, I will quickly illustrate one of these issues, and provide an effective solution.
Please note that all code is in Python 3.x, sklearn 0.17, and numpy 1.9.
Quick Recap
We left off using __dict__
representations for each of the scikit-learn
classes, converting their data structures (including those from numpy
) with a small script and storing them per pipeline item. This would make a final application look as follows:
vec = deserialize(DictVectorizer(), json.load(open('vec.json')))
svm = deserialize(LinearSVC(), json.load(open('clf.json')))
user_input = "hey can I get more examples"
grams = vec.transform(extract_grams(user_input, [2]))
print(svm.predict(grams))
# output ---------
[1]
The assumptions are that 1) your pipeline is quite small, so it’s not too convoluted to store their items separately, and 2) it has a static components, e.g. it will always use an SVM, and not do any preprocessing. If you’re interested in reproducibility only, this is good enough. For demos, however, flexibility can be important.
Drop-in Models
Let’s say we want to just allow selection of a trained model. The easiest way would be to store the pipeline in a dictionary, for example:
pipeline = {
"clf": NaiveBayes(),
"vec": DictVectorizer(),
}
It shouldn’t really matter what clf
is, as long as it has the same methods as all other sklearn
classes. Subsequently, our application can be reduced to the following:
pl = deserialize(Pipeline(), json.load(open('pipeline.json')))
user_input = "hey can I get more examples"
grams = pl['vec'].transform(extract_grams(user_input, [2]))
print(pl['clf'].predict(grams))
However, to achieve this, we would need to serialize the classes in a way that we can deserialize them to their initialized form. Hence, just storing them as their __dict__
representation is not enough.
Problem: Serializing Python Objects
How does one store a python object in a form that JSON can handle, and we can deserialize in our application? Remember that before, we set classes like so:
def deserialize(class_init, attr):
for k, v in attr.items():
setattr(class_init, k, sr.json_to_data(v))
return class_init
We already know how to set the attributes (with __dict__
), but we need a way to get a representation from a class object which we can use to initialize it. Python allows you to get a string name with __class__
:
vec = DictVectorizer()
print(str(vec.__class__))
print(vec.__class__.__name__)
print(vec.__module__)
# output ---------
"<class 'sklearn.feature_extraction.dict_vectorizer.DictVectorizer'>"
'DictVectorizer'
'sklearn.feature_extraction.dict_vectorizer'
As we can see from the output, the first returns a class object, and the second its name. However, we would need the full path in order to import it, which leaves us with having combine the latter two. From there, we could easily import and initialize it by string:
import sys
class_ = getattr(sys.modules[vec.__module__], vec.__class__.__name__)
new_vec = class_()
new_vec
# output ---------
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
sparse=True)
After, we can use setattr
again (like in the deserialize
function above) to return our settings. Just need to store them both in a format along with the __dict__
to pass to the deserializer. Something like:
import json
def serialize_class(cls_):
return sr.data_to_json({'mod': cls_.__module__,
'name': cls_.__class__.__name__,
'attr': cls_.__dict__})
def deserialize_class(cls_repr):
cls_repr = sr.json_to_data(cls_repr)
cls_ = getattr(sys.modules[cls_repr['mod']], cls_repr['name'])
cls_init = cls_()
for k, v in cls_repr['attr'].items():
setattr(cls_init, k, v)
return cls_init
cls_str = serialize_class(vec)
json.dump(cls_str, open('./vec_class.json', 'w'))
cls_js = json.load(open('./vec_class.json'))
deserialize_class(cls_js)
# output ---------
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
sparse=True)
Great! Now the classes can be used in a pipeline dictionary. As the script I provided in the previous post is recursive, these methods can be built in without much effort. However, while reading into these object serialization techniques I found an even better alternative (given that you don’t mind dependencies).
Conclusion and Package
So far I managed to manually convert most numpy
cases in scikit-learn’s modules, and store them in dictionaries for flexibility. However, I decided to sweep all of this off the table for jsonpickle. This package covers a lot more edge-cases with a way more extensive implementation. Quick demonstration:
import jsonpickle
vec_repr = jsonpickle.encode(vec)
vec_repr
# output ---------
'{"py/object": "sklearn.feature_extraction.dict_vectorizer.DictVectorizer",
"sparse": true, "sort": true, "separator": "=", "dtype":
{"py/type": "numpy.float64"}}'
And with a quick decode
we’re back to our old python storage format!
That’s it for now, if I encounter any more challenges there will be another follow-up. As before, I’ve written this up in a Jupyter notebook.