Ask HN: Experience with Protocol Buffers

15 pointsbcater16y ago10 comments

Does anyone here have experience with protocol buffers?

http://code.google.com/p/protobuf/

I need to move some data faster and with less parsing on either side of the transmission, and these seem like a good choice.

10 comments

10 comments · 7 top-level

jokull16y ago· 2 in thread

This has just been posted on HN

http://msgpack.sourceforge.net/

Make sure you benchmark simple JSON. It might be enough.

bravura16y ago

Here are some benchmarks of PB vs JSON vs Thrift: http://bouncybouncy.net/ramblings/posts/json_vs_thrift_and_p...

I also have some benchmarks of Python deserialization of JSON, and talk about alternatives here: http://blog.metaoptimize.com/2009/03/22/fast-deserialization...

I encourage you to listen to jokull, and find out if JSON really is too slow for your needs. My advisor taught me that if you keep your data in an easy-to-read format, you're more likely to catch bugs in the output, merely because you have first-class tools for inspecting the file and are more likely to do so.

joeld4216y ago

I find protobuf easier to catch bugs, because it can catch if you're missing data or trying to put the wrong type in or leaving out a required field. There is a ascii-version that is more readable than JSON, too.

That said, overall I agree with your overall sentiment, certainly do look at JSON as well. Protobuf is overkill for a lot of things, and JSON keeps things simpler.

jbooth16y ago· 1 in thread

Check out Avro, too. With both Protocol Buffers and Thrift, it's really hard to evolve schemas because you won't be able to read data written with an earlier version of the schema. Avro has the speed of binary while being flexible enough to read older data with later versions of the schema.

kleinsch16y ago

That's a misconception. If you design your schema properly, evolving it isn't a problem. We use protocol buffers (forced to because we're integrating with Google) and in the schema, everything is marked as optional so they can add or remove fields in future versions of the schema. This puts the onus on your code to properly handle missing fields, but that's the same problem you'll face with any schema that can be changed. Google has changed the schema multiple times and we've had periods where our code hasn't been updated yet. It works just fine.

joeld4216y ago

I've used this for some projects, with much success. In one case I used it as a data interchange format between a C++ app an some Python utilities. It worked very well for that and the generated API's were easy to work with, albiet a bit bulky.

In another case, more of an experiment, I used them to serialize game data and sent it using enet. This proved very flexible and easy to change/add things, and the packets were extremely compact.

Pros:

* Read/Write access to data from C++ or Python

* Generated API's were easy to work with

* Very compact representation

* Ascii-dump version very useful for debugging

* More error checking than something like json (i.e. it tells you if you leave out a required field)

Cons:

* Adds some build steps, can be more of a headache to maintain (compared to json or something)

* API can't parse ascii version, bad for config data or other stuff that might want to be human readable (vs. xml or json)

* Generally requires copying your data into the protobuf struct, and then packing, rather than going straight from your "native" format into a packed buffer.

* Adds a bit more complexity

* Not as lightweight as json

For what you're doing, I would recommend them.

They're great for "structure" style data, a little weird for array-style. For example, one of the things I was storing was a 4x4 matrix, and I resorted to making a struct with 16 members such as m_00, m_01, etc.. which worked fine and it stored it compactly but was a little weird. I don't think there's a way to have a float[16] or something like that. I could be wrong, maybe there's a better way to do this.

Generally, these days i use one of three formats. I am very happy to have outgrown xml.

protobuf -- for hierarchical, nested data, if it needs to be compact and accessed from different languages

JSON -- for quick and dirty stuff, when format needs to be flexible (or when i need to use javascript)

GTO -- for large sets of structured data. (www.opengto.org)

MichaelGG16y ago

We were using .NET's WCF messaging system, but wanted a faster/smaller format. http://code.google.com/p/protobuf-net/ let us keep our code the same, while using protobuf for the wire format. Worked quite well.

Another approach to consider is using a text format (XML, JSON), then running it through fast compression like QuickLZ. This has the benefit of not having to change the program much more than a call to compress/decompress.

pkc16y ago

Protocol Buffers will work fine and their documentation is very clear. These are my findings based on my experience -

* Serializing data is ok but parsing takes quite a bit of time especially for large requests. (I am talking in milli seconds) * PBs always require a copy from your internal app data to its structures. Couldn't find a way to avoid that. * They have variable length encoding and it might be a good option if your data comprised of large percent of integers. From our experience don't use it if you are sending within your corp network as packing and parsing takes more time compared to savings in amount of data transfer. They might be a good option if you are sending data across slow networks.

Some of the metrics show that Thrift performs better than PBs. Also Thrift provides options of using different protocols. If Performance is prime criteria JSON + zipping should be a good option. Also they won't have an intermediate step of generating marshaling code.

JoelPM16y ago

Protos will work fine. Thrift will also work fine. ________ (insert other binary format) will also work fine.

As long as there are libraries for the languages you're using it's not a big deal. I'd recommend solving the problem and moving on - in the serialization format wars the real victim is productivity.

pwpwp16y ago

If your data layout is fairly static, PBs are good.

What I did for an app was encode a kind of JSON in PBs:

http://pwpwp.blogspot.com/2009/08/storing-json-as-protocol-b...

j / k navigate · click thread line to collapse

10 comments

10 comments · 7 top-level

jokull16y ago· 2 in thread

This has just been posted on HN

http://msgpack.sourceforge.net/

Make sure you benchmark simple JSON. It might be enough.

bravura16y ago

Here are some benchmarks of PB vs JSON vs Thrift: http://bouncybouncy.net/ramblings/posts/json_vs_thrift_and_p...

I also have some benchmarks of Python deserialization of JSON, and talk about alternatives here: http://blog.metaoptimize.com/2009/03/22/fast-deserialization...

joeld4216y ago

That said, overall I agree with your overall sentiment, certainly do look at JSON as well. Protobuf is overkill for a lot of things, and JSON keeps things simpler.

jbooth16y ago· 1 in thread

kleinsch16y ago

joeld4216y ago

In another case, more of an experiment, I used them to serialize game data and sent it using enet. This proved very flexible and easy to change/add things, and the packets were extremely compact.

Pros:

* Read/Write access to data from C++ or Python

* Generated API's were easy to work with

* Very compact representation

* Ascii-dump version very useful for debugging

* More error checking than something like json (i.e. it tells you if you leave out a required field)

Cons:

* Adds some build steps, can be more of a headache to maintain (compared to json or something)

* API can't parse ascii version, bad for config data or other stuff that might want to be human readable (vs. xml or json)

* Generally requires copying your data into the protobuf struct, and then packing, rather than going straight from your "native" format into a packed buffer.

* Adds a bit more complexity

* Not as lightweight as json

For what you're doing, I would recommend them.

Generally, these days i use one of three formats. I am very happy to have outgrown xml.

protobuf -- for hierarchical, nested data, if it needs to be compact and accessed from different languages

JSON -- for quick and dirty stuff, when format needs to be flexible (or when i need to use javascript)

GTO -- for large sets of structured data. (www.opengto.org)

MichaelGG16y ago

pkc16y ago

Protocol Buffers will work fine and their documentation is very clear. These are my findings based on my experience -

JoelPM16y ago

Protos will work fine. Thrift will also work fine. ________ (insert other binary format) will also work fine.

As long as there are libraries for the languages you're using it's not a big deal. I'd recommend solving the problem and moving on - in the serialization format wars the real victim is productivity.

pwpwp16y ago

If your data layout is fairly static, PBs are good.

What I did for an app was encode a kind of JSON in PBs:

http://pwpwp.blogspot.com/2009/08/storing-json-as-protocol-b...

j / k navigate · click thread line to collapse