Msgpack vs Json in Python: Preliminary Investigation

Earlier this year in the Falcon documentation I chanced upon MsgPack. It has a pretty snazzy website you can find here. Wow, look at that! It's both smaller and faster than JSON!

I've compiled here a complete picture of the uses and disuses for the Msgpack encoding scheme. This document was originally intended for personal reference, as the use of MsgPack is both intriguiging and nuanced. Below are my notes.

A nominal amount of research brought up an article on Indiegamr entitled "MsgPack vs. JSON: Cut your client-server exchange traffic by 50% with one line of code". Wow, that sounds pretty good, the article gave the following benefits:

It's smaller and uses less bytes
It's faster to parse. The example alluded to the fact that a smaller representation required less data reading/writing/processing per actual information in our message. I thought that was a nifty idea.

The article makes an argument for the slightly less readable quality of msgpack. Considering the opposite is usually considered a bonus of JSON, I found this argument amusing. If network eaves-dropping is actually a problem then there must be better solutions. The strings found in MsgPack encodings, just like JSON encodings, are plainly represented. If data security is an issue I'd look into some cryptographic system. Hell, even GZIPing it would be better than Msg Packing it. But I guess we are to consider this an offhand bonus of the project, I digress.

A couple of links later I found myself on the HackerNews thread which featured that same article. Among the hubub there were a few shining comments:

This one by cheald:

JSON's appeal is that it is both compact and human readable/editable. If you're willing to sacrifice all semblance of readability/editability, then sure, let's all do the binary marshaling format dance like it's 1983.

Additionally, if you're sending data to a browser, then you're cutting the knees out of native JSON.parse implementations (Internet Explorer 8+, Firefox 3.1+, Safari 4+, Chrome 3+, and Opera 10.5+). The copy claims "half the parsing time" (just because of smaller data size), but I'm exceptionally skeptical of those claims since this is just going to move the parser back into Javascript.

And this one by catch23:

Having actually used MsgPack, it's nice that's compact, but bad that it doesn't handle utf-8 properly. The reason data is smaller is because they're basically using less bits depending the value, eg if the numerical value is "5" then you can use 3 bits to represent the value whereas JSON will always use floats to represent integer values. If you know exactly what your data in the JSON might be, MsgPack is nice, otherwise it can be a pain in the butt if you're sending arbitrary data from users.

Another points out that HTTP headers are a fixed size, and using small payloads will not necessarily 'half' your traffic.

Another comments thread says this is cheating because in one particular library the decoder uses some technique which invalidates it.

Next we find a [letter][https://gist.github.com/frsyuki/2908191] from the creator of the project entitled "My Thoughts on Msgpack". Here are the interesting points:

It may not be the best choice for client-side serialization
The implementation of msgpack is "zero-copy"

Another oft-referenced article is that by Pinterest, who apparently encodes information in msgpack before storing it in Redis. All links I found referencing the article produced 404s, apparently Pinterest has removed that post. The Pinterest banner on the MsgPack website links only to Pinterest.com. But thank goodness for the Wayback Machine!

You can find the article here. It's an interesting read for those uninitiated in 'redis' and 'memcached'. In short, Pinterest uses msgpack to store data objects in memcachd instead of using the possibly more expensive redis. Apparently Redis objects require more memory than msgpacked objects.

The Pinterest blog uses the streaming feature of Msgpack. They have no need for a separator and can simply append more MsgPack objects to a key in Memcached. They don't explain why they use this over JSON, but I'll take a stab at the reasons. Msgpack is a smaller representation than JSON, and this was their initial motivation from moving from Redis objects. Pinterest also makes a reference to encoding/decoding speed. I need to do a comparison of JSON and Msgpack encoding using a zero-copy json library in order to understand where exactly the speed increase lies. Supposing JSON and Msgpack encoding takes about the same at their fastest, JSON would require an extra step to compress the data. Unfortuntaely, this compression of the entire list prohibits appending to the list. Compression would need to be run on each list item individually. The list entries need to be of a some threshold size before compression actually saves space. Furthermore, it's likely that compressing each list entry breaks the streaming capabilities of your library. Therefore you would need ensure the length of the encoded data is stored at the beginning of the entry. This may be default (I can't quite interpret the function description).

Pro/Cons

Con: * Gzip + JSON is a better deal * MsgPack uses a zero-copy method, I don't know what the limitations of this areyet. * MsgPack encodes numbers with variable sizes, so data cannot be edited in place * MsgPack loses JSON's readability * Browser JSON parsers are already optimized with C, a JS MsgPack parser canot compete

Pro: * MsgPack is actually pretty small * MsgPack library support is pretty great. It was easy to use. * Fast encoding/decoding with the 'zero-copy' technique, this is possible in JSON. * Streaming/append technique used by Pinterest

Testing with Python

I've written some basic tests, but I haven't written up the results just yet. Thanks.