Protocol Buffers v3.0.0 released (opens in new tab)

(github.com)

278 pointsRican79y ago123 comments

123 comments

84 comments · 14 top-level

amluto9y ago· 18 in thread

They added a feature that impressively fails to interoperate with the rest of the world.

> Added well-known type protos (any.proto, empty.proto, timestamp.proto, duration.proto, etc.). Users can import and use these protos just like regular proto files. Additional runtime support are available for each language.

From timestamp.proto:

  // A Timestamp represents a point in time independent of any time zone
  // or calendar, represented as seconds and fractions of seconds at
  // nanosecond resolution in UTC Epoch time. It is encoded using the
  // Proleptic Gregorian Calendar which extends the Gregorian calendar
  // backwards to year one. It is encoded assuming all minutes are 60
  // seconds long, i.e. leap seconds are "smeared" so that no leap second
  // table is needed for interpretation.

Nice, sort of -- all UTC times are representable. But you can't display the time in normal human-readable form without a leap-second table, and even their sample code is wrong is almost all cases:

  //     struct timeval tv;
  //     gettimeofday(&tv, NULL);
  //
  //     Timestamp timestamp;
  //     timestamp.set_seconds(tv.tv_sec);
  //     timestamp.set_nanos(tv.tv_usec * 1000);

That's only right if you run your computer in Google time. And, damn it, Google time leaked out into public NTP the last time their was a leap second, breaking all kinds of things.

Sticking one's head in the sand and pretending there are no leap seconds is one thing, but designing a protocol that breaks interoperability with people who don't bury their heads in the sand is another thing entirely.

Edit: fixed formatting

justinsaccount9y ago

It's interesting that you refer to a huge amount of planning and engineering as "sticking your head in the sand".

https://googleblog.blogspot.com/2011/09/time-technology-and-...

I think that the approach everything else uses is the "sticking your head in the sand approach". You basically pretend that there is no problem and that time is perfectly accurate, up until you have a minute with 59 or 61 seconds.

Just because suddenly trying to handle "Oh shit, everything is off by an entire second!" is the approach everything else uses doesn't mean it is the right approach.

amluto9y ago

No, I agree they did a bunch of good engineering for internal use.

But they didn't keep it internal properly -- the real world has leap seconds for better or for worse, and this library really does stick its head in the sand and pretend they don't exist. Google specifically says that this library is designed to be "the foundation of Google's new API platform". Yet they give a data type (as a headline feature) and a sample usage that is simply incorrect if you don't set your system to work using Google's "leap smear". It also seems quite likely that it'll result in blatantly wrong human-readable strings. I'll even quote a string from timestamp.proto [1]:

9999-12-31T23:59:59Z

That looks like an RFC 3339 string, and it even has the 'Z' suffix, which means it's UTC, which has an agreed-upon international definition. But this is not a valid UTC time. It's a time in a different time zone that Google made up.

Google easily could have done better: publish a spec for a different kind of time like:

9999-12-31T23:59:59s

where the little 's' means 'smeared'. Supply a serializer and deserializer for that. Now there's no ambiguity.

[1] https://github.com/google/protobuf/blob/master/src/google/pr...

wongarsu9y ago

>You basically pretend that there is no problem and that time is perfectly accurate, up until you have a minute with 59 or 61 seconds.

Time is perfectly accurate, including all the minutes with 59 or 61 seconds. UTC is perfectly defined as atomic time (TAI) with an offset to keep it within 0.9 seconds of UT1 (time as measured by the rotation of the earth). Every time we increment or decrement this offset, this leads to leap seconds. But since 23:59:60 is a valid time (and distinct from 00:00:00 on days with leap seconds), there is no ambiguity here.

The problem here is how most computers handle this: introducing ambiguity by setting the clock backwards or forwards one second, instead of accounting for the fact that not all minutes have 60 seconds. Google did a pragmatic fix for their use case by squeezing leap seconds into the surrounding seconds, stretching them. It works for them, but now their "seconds" are not actual seconds anymore.

wongarsu9y ago

It's fine as a timestamp implementation, and great for many uses. But I think a big problem with the documentation. They start off by saying it's "at nanosecond resolution in UTC Epoch time", and then they go on to explain how it uses a completely different encoding that is neither compatible with UTC nor with TAI (atomic time which ignores leap seconds). And then they jump ahead to sample code which again pretends that the timestamp is UTC.

No matter whether you like "google time" or not, this is horrible documentation. They are glossing over an issue which should be marked with big red letters.

haberman9y ago

The question of how to reconcile leap-second-smearing systems with other systems is an interesting and important one. I'm not sure that timestamp.proto changes this issue: prior to timestamp.proto systems would still communicate using UNIX time (smeared or non-smeared) using plain integer or double seconds. timestamp.proto just provides a structure for storing UNIX time with greater range and precision than a single integer or floating point number can provide.

What I'm trying to say is that I think this is a smearing systems vs. non-smearing systems issue, and not so much a timestamp.proto issue. timestamp.proto mentions smearing but really it's just a vehicle for storing the seconds/nanos from the system clock, with whatever semantics that system clock uses. Because in practice systems don't give you access to both the smeared and non-smeared values; you get whatever the system gives you. The remarks about being leap-second-ignorant apply whether the leap second is being smeared or repeated.

Google implemented leap-second smearing in 2011, before the big push towards cloud. So the need to communicate sub-second timestamps between internal Google systems and external systems was probably not so much on people's minds. But these days we're releasing a bunch of APIs, and sub-second timestamps might become a more important issue for some of them.

So I think this issue is worth discussing further, and I opened an issue on GitHub to track it: https://github.com/google/protobuf/issues/1890

Thanks for the feedback.

jschwartzi9y ago

This is only an issue if you use the Timestamp to represent a human-readable time. There are more uses for timestamping than for display to a human operator. For example, one might use a timestamp in a software system to detect the passage of time, as in the use of a monotonic clock. In a real-time system you would ignore the presence of leap seconds because you will never examine the timing of your system relative to a Gregorian calendar. Rather, you just want to make sure that the station-keeping engine on your satellite burns for exactly 250 milliseconds, and leap seconds are of no use in that application.

amluto9y ago

If you use Google's timestamp type to burn for 250ms, you might end up with 250*86401/86400 ms. That's not a fantastic outcome.

wahern9y ago

I think you have it exactly backwards, if I understand things correctly.

It _seems_ like their "UTC Epoch time" is the same thing as POSIX time, but the Google engineer's terminology is all fubar. The reliance on the Proleptic Gregorian Calendar is further proof as that's a reference to a specific algorithm for calculating calendar dates.

POSIX time says that there are precisely 86400 "seconds" per day, which I think implies the same thing as saying there being precisely 60 "seconds" per minute. The logical consequence is, of course, that in neither case is "second" referring to the SI second.

Once you get over the fact that we're discussing different units of time, then you can see that POSIX time is _perfect_ for recording and manipulating civil calendar time. For the purposes of calendar manipulation, you rarely if ever need to know elapsed time in SI-unit seconds. All you care about is easily calculating past and _future_ calendar information. Your power company and credit card companies don't bill you by SI seconds, they bill you by the hour, day, week, or month.

Conversely, in those situations where you want accurate and precise SI-second measurements, you rarely if ever want to convert or display that data in terms of calendar time. When SpaceX sends a rocket into space, the view screen shows elapsed seconds since launch, not elapsed seconds since lunch. That's a big difference.

Interestingly, in neither case do leap seconds matter! They're irrelevant. Leaps second play no part in either TAI or POSIX time.

There are some cases where you want both pieces of information, but I think it's usually a mistake to conflate them and try to shoehorn them into the same units. That misguided practice is behind all the anxiety about leap seconds in UTC time.

It's also worth noting that as clocks become increasingly precise and accurate that the whole leap second thing will fade away. UTC time is based on the fiction that there's an abstract, universal clock in the world that is measurable in SI seconds. There isn't. At some point the needs of routine industrial measurements will enter the realm where relativity governs, at which point the fiction will be laid bare. Calendar time, of course, doesn't rely on that fiction.

The move to uncouple civil time from solar time is totally misguided, IMO, and only exacerbates the improper way that software engineers conflate the purpose and function of various time measurements.

cbsmith9y ago

I've always used uint64's for that. Why would you need a distinct type.

icedchai9y ago

It's a serialization format containing seconds and microseconds. You can put whatever you want in there, including true (non-Google) UTC time, right? This seems more like a documentation problem than an actual problem with Protobuf.

jhspaybar9y ago

It saddens me that this is the top comment. It's complete and total FUD unrelated in any way to what Proto is, and to boot, it's an optional type, provided if you want it, but otherwise not forced to be used in any way! Scroll down the page for much more worthwhile discussions of Proto.

Dylan168079y ago

A tangent, perhaps. But it's not FUD.

cbsmith9y ago

There's really no reason you can't provide your own timestamp structure, or your own timestamp transformation logic...

lmm9y ago

I'm glad they're willing to break compatibility to push their approach, because I think it's a better one. UTC with leap seconds is the worst of all possible worlds - not suitable for human time, not suitable for system time either - as perennial leap second bugs in such high-profile projects as the linux kernel demonstrate. Everyone seems to have agreed for years that basing system time on something without leap seconds would be better - whether that be leap smears or TAI - but no-one bothers to take action.

brazzledazzle9y ago

Regarding the leaking of NTP, are you talking about Systemd's default pointing at Google's NTP servers or some other event?

madgar9y ago

> designing a protocol

It's not a full protocol. It's a data type for a serialization library. You can write your own data types and they serialize just as well as the built-in types.

> that breaks interoperability

Wait, what was "broken" here? What was working before that isn't with this new release? What does this inclusion of a utility data type in a serialization library break that previously was intact?

zxv9y ago

Does this depend on use of Google's time servers?

The dependence on "smeared" leap seconds sure sounds like a dependence on such a time server.

Ouch.

Nullabillity9y ago

I can see caring about leap seconds right now, but a few seconds back or forth in the past probably won't matter very much.

manish_gill9y ago· 13 in thread

If someone better informed than me can please explain - where and why would something like Protocol Buffers be useful?

wwalser9y ago

Imagine working on a team that wants to move quickly but whose output is both a product and an API that's consumed by multiple other teams. The product you are building uses said API, but so do other teams. Your code needs to be stable enough to support these other teams needs (an API which doesn't change under them) but you also want to be able to make changes to your own application quickly, thus needing to change the API regularly.

A reasonable move is to version said API and have an ops team that ensures that all in-use versions of the API stay running. Some consumers will be on the bleeding edge, your team's application for example while others will lag behind.

Using proto* in this case is a reasonable move because you gain multiple benefits, performance being perhaps the least important in this case. Having a defined schema for your API provides some level of natural documentation for the API. Code generation allows your team to publish trusted client libraries for multiple languages.

I'll specifically call out client libraries since I've seen it make a dramatic difference in organizational efficiency, mostly to do with team to team trust levels. Without a client library the testing situation becomes a significant burden, read up on contract testing. When the team that's publishing an API also creates the client that most directly calls that API, the client library is the testing surface instead of every consumer of the API needing to test the API itself for regressions.

zellyn9y ago

We use them internally at Square for our RPC mechanism ("Sake", similar to "Stubby", Google's internal RPC mechanism), for our Kafka-based logging/metrics/queue infrastructure, and for defining external JSON APIs. We're in the process of switching from Sake to GRPC, which also use Protobufs as their payload format (although you can sub in different transports).

zellyn9y ago

I should mention that we use Ruby, Java, and Go. So protobufs are also the "lingua franca" for cross-language communication.

dkopi9y ago

> Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

From https://developers.google.com/protocol-buffers/

manish_gill9y ago

Yes, I read this. It tells me what Protocol Buffers are. Faster, Smaller XML like data structures for serialisation. What are the most common use cases though? And do people only use them for performance reasons?

5 more replies

sigil9y ago

I'll give you an example.

I used protobuf as the output format for a web crawler. Workers read urls and sequentially write entire HTTP responses to disk. [0] Sure, you could serialize the responses to JSON, but the overhead of representing things like binary image data as escaped unicode strings was prohibitive in my case.

"Why not BSON?" Well, schemas can be nice when performance matters. Instead of solving a parsing problem at runtime, a C/C++ reader can contain a compiler-optimized deserializer for a given protobuf schema. It's almost like directly reading and writing an array of C structs, except protobuf is architecture-independent, and you can add new fields without breaking old readers.

There are plenty of reasons to not use protobuf. I particularly disliked the code generation step for C/C++. That makes even less sense in a language like Python, and yet that's exactly what the official python protobuf implementation from Google does (did?). I wrote a python protobuf library on top of a C protobuf library that avoids codegen: https://github.com/acg/lwpb

[0] See the ARC format used by the Internet Archive for a similar (and imo clunkier) solution. http://crawler.archive.org/articles/developer_manual/arcs.ht...

phamilton9y ago

For me there are three main advantages: schema, performance and code generation.

Having a strict schema makes it a lot easier to maintain applications in a distributed system. Parsing protobuf is much faster than something like JSON. The multitude of code generators for protobuf make it really simple and easy to use multiple languages on the same data structures.

lordnacho9y ago

I used it in a trading system because it's a compact scheme for sending data across networks. It's also quite fast, and there's support for various languages. So you can have a feed handler blasting out prices using a c++ implementation, with a GUI drawing a chart written in c#.

arnarbi9y ago

Serializing data for RPC, network protocols or storage, description and serialization of configuration, serializable state, serializing complex types for cryptographic signing, etc.

Why is it useful? The schema both documents the data structure and allows mappings to natural APIs in many different languages. Parsers and encoders are generated for you, and are fast.

NikhilVerma9y ago

At Badoo we use them to have a unified API for all of our platforms (Web, Mobile Web, Android, iOS, Windows Phone etc). This would not have been possible without something like ProtoBuf.

vtalwar9y ago

Nikhil, would love to know more about your architecture.

1 more reply

nawitus9y ago

When JSON is not fast/lightweight enough.

rainhacker9y ago

check this out: http://google-opensource.blogspot.com/2008/07/protocol-buffe...

zellyn9y ago· 8 in thread

- removing optional values is actually quite nice. In practice, I end up checking for "missing or empty string" anyway.

- the "well-known types" boxed primitive types essentially add optional values back in. And depending on your language bindings, may look the same.

- extensions are still allowed in proto3 syntax files, but only for options - since the descriptor is still proto2. It seems odd to build a proto3 that couldn't represent descriptors.

- I still don't understand the removal of unknown fields. Reserialization of unknown fields was always the first defining characteristic of protobufs I described to people. I actually read many of the design/discussion docs internally when I worked at Google, and I still couldn't figure this one out. Although it's certainly simpler…

- Protobufs are the "lifeblood" (Rob Pike's words) of Google: the protobuf team is working to get rid of significant Lovecraftian internal cruft, after which their ability to incorporate open source contributions should improve dramatically.

tantalor9y ago

> removing optional values

Slight correction: optional values are not removed. Quite the opposite; the "optional" keyword is removed because now all fields are optional. It is actually required values which were removed.

zellyn9y ago

True. But when using them, it feels like every field is "present" and you don't have to worry about the "optional, missing" case.

1 more reply

honkhonkpants9y ago

You're both right. What has been removed is the concept of presences altogether.

teacup509y ago

> - removing optional values is actually quite nice. In practice, I end up checking for "missing or empty string" anyway.

I feel the opposite; this greatly reduces the utility of protobuf.

Previously, I could trust that if parsing succeeded, then I had a guarantee of a populated data structure.

Now, I have to check each field individually, in manually written code, to verify that no required fields are missing.

That's really lame, and a huge step backwards.

smallnamespace9y ago

> I could trust that if parsing succeeded, then I had a guarantee of a populated data structure

Using required fields have actually bit Google more than once and were increasingly being considered harmful.

A canonical example is that you add a required field, and then update binaryA in production (which receives messages from binaryB), which immediately crashes or errors out because the new field is missing.

So practically speaking, you can never add required fields to any message where you can't guarantee binary version syncing amongst all instances of the message-dependent services. At scale, this is essentially operationally impossible.

And if you're not running an RPC-based service architecture, then why are you using protos anyway?

2 more replies

cbsmith9y ago

In practice, to have decent compatibility as revisions changed, you really had to minimize the use of "required" fields anyway. While I agree it was sometimes nice to be able to avoid having to worry about it, in practice protobuf parsing imposes a very minimal set of constraints on data types. A successful protobuf parse was not nearly enough to ensure you had data integrity. I've run in to more than a few cases of developers using the wrong protobuf (v2) definition and not realizing their successful parse was still wrong.

peq9y ago

I agree. In particular in languages without null you will have a lot of Option types in the mapping. You no longer can generate useful type definitions from the proto spec.

madgar9y ago

> Now, I have to check each field individually, in manually written code, to verify that no required fields are missing.

You always had to check the individual fields for the zero value. A required field in a proto2 message can be set but also have the default value and pass initialization.

2 more replies

rdtsc9y ago· 6 in thread

How does this compare or in general why would you pick this vs newer formats like Cap'n'proto or FlatBuffers?

From FlatBuffers overview I see this comparison:

---

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

---

So are the newer ones useful mostly when serialization vs deserialization speed matters (https://google.github.io/flatbuffers/) ?

cbsmith9y ago

Also when you want to memory map a file/have live objects in shared memory, or in general have your in-memory & serialized structures be the same.

jokoon9y ago

I don't know, but I tried using protocol buffer once for mapbox vector files, the resulting C++ header was huge. It had templates and all sort of things, something like more than 1000 lines.

jackmott9y ago

Cap'n'proto is more or less abandoned I believe. But it and the flatbuffer approach gives very fast serialization and deserialization speed (essentially takes 0 times) but you pay a cost when you later access data, because it extracts the values you need on demand from the raw bytes.

I'm not sure it would often make much sense overall.

dwrensha9y ago

> Cap'n proto is more or less abandoned I believe

As maintainer of capnproto-rust, I beg to differ. :)

Cap'n Proto is indeed actively maintained, and here at Sandstorm we depend on it every day as a core piece of our infrastructure.

ocdtrekkie9y ago

I would be very hesitant to call Cap'n Proto "abandoned". The Cap'n Proto developer is actively building a platform on top of it, and implements features in it as necessary, and as far as I've seen, actively works with pull requests for other features as well.

https://github.com/sandstorm-io/capnproto/commits/master https://github.com/sandstorm-io/capnproto/pulse/monthly

cbsmith9y ago

What makes you think it is more or less abandoned?

skybrian9y ago· 5 in thread

Sadly the JSON format they chose isn't actually suitable for high-performance web apps. Web developers who use protobufs will continue to get by with various nonstandard JSON encodings.

positr0n9y ago

Why isn't is suitable? (I've never used protobufs)

skybrian9y ago

The fields are indexed by field names (converted to lower camel case) instead of tag numbers. It's great for readability, but it's a lot more verbose, particularly for repeated fields.

2 more replies

detaro9y ago

What characteristics of a JSON format would be important?

the_duke9y ago

Why would you use JSON in a high performance context anyway?

skybrian9y ago

Because the code is running in a browser.

Browsers do support binary data using typed arrays but for some reason this isn't commonly used yet. Compatibility, maybe?

1 more reply

JoachimSchipper9y ago· 4 in thread

This looks like a nice evolution.

It's a pity that the "deterministic serialization" gives so few guarantees; I have worked on at least one project that really needed this.

(Basically, we wanted to parse a signed blob, do some work, and pass the original data on without breaking the signature; unfortunately, this requires keeping the serialized form around, since the serialized form cannot be re-generated from its parsed format.)

pherl9y ago

The main concern that the deterministic serialization isn't canonical is due to the unknown fields. As string and message type share the same wire type, when parsing an unknown string/message type, the parser has no idea whether to recursively canonicalize the unknown field.

The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.

Feel free to start an issue on the github site asking for canonical serialization with your use case. We may change the deterministic serialization with stronger guarantee (e.g. cross language consistency) or add another API for canonical serialization.

JoachimSchipper9y ago

This was years ago; I'd feel bad asking you to do a lot of work to support one niche use case in a research project that never quite made it to market. And protobufs ended up saving us quite a bit of development work, even if keeping the blob around is Wrong in a moral sense.

(You can find the niche use case in a response to your sibling comment, BTW.)

cbsmith9y ago

In a trusted system, if you don't trust the structure you are working with, why would you trust the signature?

I'd want to always work from the signed blob.

That said, this is one reason to use flatbuffers/capt'n proto I guess: you don't have to worry about this since you never unpack the blob.

JoachimSchipper9y ago

Think of a data flow A->B->C, with A e.g. handling incoming message server, B being a spam/virus filter, and C holding the user's mailbox. Spam/virus filters are useful, but are also rather vulnerable - so C is willing to trust B's spam/non-spam judgement, but wants to ensure that B can't alter or make up messages.

If protobufs had one canonical encoding, B could unpack the message and re-pack it when done; with the current protobuf implementation, B needs to keep the original blob around. In either case, C needs to check the signature on whatever blob it receives.

(Some details have been changed.)

1 more reply

forrestthewoods9y ago· 4 in thread

Google also has flatbuffers. I wonder if flatbuffers is being used by enough developers to justify significant development?

https://github.com/google/flatbuffers

IshKebab9y ago

I think it's more that GRPC (Google's RPC-over-HTTP2 protocol) directly supports Protobuf, and not Flatbuffers. All of Google's Cloud APIs use Protobuf (for example the [Speech API](https://cloud.google.com/speech/reference/rpc/) ).

I have to say, GRPC is pretty great. It's statically typed, supports loads of languages, the interfaces are simple to define (basically Protobuf), and it supports streaming requests! Most RPC systems omit that, or only have message streams (e.g. MQTT). Good RPC systems need both.

The only downside I find is that it is rather complicated (in design; not use).

forrestthewoods9y ago

As an FYI, GRPC support was added to flatbuffers a month ago. https://github.com/google/flatbuffers/tree/master/grpc

alfalfasprout9y ago

Been using flatbuffers in production for a high speed market feed for a month now. Love it. Decode/encode time is absurdly fast (~1-2 microseconds for a small to medium schema). If you're pushing 50k+ events/second it can be a great choice. Takes up almost no space on the wire too.

amluto9y ago

Try Cap'n Proto instead. Better designed and faster.

2 more replies

mattiemass9y ago· 3 in thread

Wow, this seems to address a bunch of problems I've experienced with protobuf in the past. Looks awesome!

grosbisou9y ago

Could you expand on the problems you encountered?

colanderman9y ago

I've never looked at proto3, but proto2 has at least the following issues:

* No clue about namespacing. If you pick the wrong name for something, you can have name clashes within a protobuf, across uninterpreted option classes, with protobuf source code, with your own source code; and it's different if you're in Python or C. Nowhere are naming restrictions defined.

* The API is maddening and inconsistent, especially in Python. (It's totally different between Python and C.) Some things look like lists but really aren't (e.g. you can't assign a list to a repeated field in Python). Even basic reflection (e.g. to get at uninterpreted options) is a Lovecraftian nightmare, and the docs are wholly unhelpful.

* Good luck serializing a list. There's not really such a thing, despite that the API pretends like there is; there are only repeated fields. So you need a separate flag to distinguish "empty list" from "not present list".

* Abstruse implementation. There are so many layers of indirection in the generated source and the core library that I wouldn't know where to start debugging.

Not sure if they fixed any of these issues with proto3.

2 more replies

mattiemass9y ago

Dealing with forward- and backwards-compatibility with enum changes has bit me many times in the past. So has required fields.

andrewmcwatters9y ago· 3 in thread

Could someone explain to me why you would use Protocol Buffers, Cap'n Proto, etc versus rolling your own type-length-value protocol besides API interop?

What if your team could write a smaller TLV protocol, and it was necessary to keep your codebase small? Would this not be wise? Are Protobufs and party not comparable to TLV protocols?

euyyn9y ago

In the vast majority of cases, you want your team to spend their time doing something other than reinventing protos, debugging the in-house implementation, maintaining the library, etc.

It's not clear to me anyway how doing it yourself would help keeping your codebase small vs using protos. In terms of code to maintain, doing it yourself is a net loss. In terms of binary size and method count, the proto libraries for Objective-C and Android are optimized like crazy.

andrewmcwatters9y ago

Those are all reasons why I wanted to use protobufs to begin with. It sounded like it solved many issues for us.

But I'm thinking about scripting environments, where the data types used in protobufs don't exist in the host language. Simple things like this. I think in the implementations I've seen, they're just coerced or ignored. That's fine, imo.

But in terms of small codebases: a simple TLV protocol, where only limited data types are implemented, can be 1/10th of the size of any protobufs implementation.

My team has built out a high performance type-length-value system that doesn't require compiled schemas for game development, and we have a very small serialization lib that's smaller than any protobufs implementation for our target language.

I'd like to use protobufs to decrease the amount of modules we have to personally maintain, but I don't see the value in doing so for our particular situation.

1 more reply

dyoo19799y ago

The efforts toward making the protocol robust might be helpful, depending on context. https://groups.google.com/d/topic/protobuf/DwyPEnvFJ-o/discu...

wehadfun9y ago· 3 in thread

In C# why use Protocol Buffer over the XML or binary serializes?

klodolph9y ago

The C# binary serializer is not really comparable in terms of what it does. It's more like Python's Pickle library.

http://stackoverflow.com/questions/703073/what-are-the-defic...

C# binary serialization is only useful in certain circumstances. It doesn't work outside the .NET world and it even has compatibility problems within the .NET world—you can break deserialization by making certain changes to your code. From the Microsoft documentation:

> The state of a UTF-8 or UTF-7 encoded object is not preserved if the object is serialized and deserialized using different .NET Framework versions.

(From https://msdn.microsoft.com/en-us/library/72hyey7b(v=vs.110)....)

Also see https://msdn.microsoft.com/en-us/library/ms229752(v=vs.110)....

bmm6o9y ago

Performance and data size are much better with protobufs: http://stackoverflow.com/questions/549128/fast-and-compact-o.... Built-in serializers are only workable when both ends are on the same platform (i.e. .Net), and even then class versioning can be a problem.

recursive9y ago

Your message will be about 5% the size of the xml one, and it will be backwards compatible, unlike the built-in binary serializer.

zbjornson9y ago· 2 in thread

> primitive fields set to default values (0 for numeric fields, empty for string/bytes fields) will be skipped during serialization.

I don't totally understand this. Presumably during deserialization they will be set to defaults and not missing? Otherwise, coupled with the removal of required fields, it seems impossible to actually send a 0-value number or empty string, or to send a proto without a field and not have it set to 0 or "" (have to explicitly null the field?).

prattmic9y ago

Within the API, proto3 does not have the concept of field presence. All fields are "present" and default to their type's zero value.

Since the client can handle this, there is no need to explicitly serialize default values.

merb9y ago

and how do you send a explicit zero so that the client knows that the field is really set by the server and not the default? or a explicit empty string?

3 more replies

jalfresi9y ago· 1 in thread

"The main intent of introducing proto3 is to clean up protobuf before pushing the language as the foundation of Google's new API platform"

Does anyone know if this means Google's public APIs will be proto3 based? I quite like protobufs.

agency9y ago

They've been experimenting[1] with exposing Google Cloud Platform APIs over gRPC (which is powered by proto3), so it seems quite likely.

[1] https://cloud.google.com/blog/big-data/2016/03/announcing-gr...

gonyea9y ago

Shocking! Google's started supporting more languages than just the ones they care about. I really hope this signals the death of their disdain culture.

Being a worthwhile Cloud provider means hiring experts in all sorts of languages and supporting their efforts.

Imagine a world where Google didnt just "support node" (YEARS late), but actually turned their v8 expertise into a Cloud product.

But that'd involve convincing Java-devs-turned-VPs to care about JavaScript, <2004>and EVERYONE knows that JavaScript is a terrible language.</2004>

blt9y ago

I was hoping for packed serialization of non-primitive types. I once used Protobuf to serialize small point clouds, and ended up needing to serialize them as a packed double array and reconstruct the (x, y, z) structure at read time to avoid Protobuf malloc'ing each point individually. Not a huge deal, but it would be a real pain for more complex types.

j / k navigate · click thread line to collapse

123 comments

84 comments · 14 top-level

amluto9y ago· 18 in thread

They added a feature that impressively fails to interoperate with the rest of the world.

From timestamp.proto:

  // A Timestamp represents a point in time independent of any time zone
  // or calendar, represented as seconds and fractions of seconds at
  // nanosecond resolution in UTC Epoch time. It is encoded using the
  // Proleptic Gregorian Calendar which extends the Gregorian calendar
  // backwards to year one. It is encoded assuming all minutes are 60
  // seconds long, i.e. leap seconds are "smeared" so that no leap second
  // table is needed for interpretation.

Nice, sort of -- all UTC times are representable. But you can't display the time in normal human-readable form without a leap-second table, and even their sample code is wrong is almost all cases:

  //     struct timeval tv;
  //     gettimeofday(&tv, NULL);
  //
  //     Timestamp timestamp;
  //     timestamp.set_seconds(tv.tv_sec);
  //     timestamp.set_nanos(tv.tv_usec * 1000);

That's only right if you run your computer in Google time. And, damn it, Google time leaked out into public NTP the last time their was a leap second, breaking all kinds of things.

Edit: fixed formatting

justinsaccount9y ago

It's interesting that you refer to a huge amount of planning and engineering as "sticking your head in the sand".

https://googleblog.blogspot.com/2011/09/time-technology-and-...

Just because suddenly trying to handle "Oh shit, everything is off by an entire second!" is the approach everything else uses doesn't mean it is the right approach.

amluto9y ago

No, I agree they did a bunch of good engineering for internal use.

9999-12-31T23:59:59Z

Google easily could have done better: publish a spec for a different kind of time like:

9999-12-31T23:59:59s

where the little 's' means 'smeared'. Supply a serializer and deserializer for that. Now there's no ambiguity.

[1] https://github.com/google/protobuf/blob/master/src/google/pr...

wongarsu9y ago

>You basically pretend that there is no problem and that time is perfectly accurate, up until you have a minute with 59 or 61 seconds.

wongarsu9y ago

No matter whether you like "google time" or not, this is horrible documentation. They are glossing over an issue which should be marked with big red letters.

haberman9y ago

So I think this issue is worth discussing further, and I opened an issue on GitHub to track it: https://github.com/google/protobuf/issues/1890

Thanks for the feedback.

jschwartzi9y ago

amluto9y ago

If you use Google's timestamp type to burn for 250ms, you might end up with 250*86401/86400 ms. That's not a fantastic outcome.

wahern9y ago

I think you have it exactly backwards, if I understand things correctly.

Interestingly, in neither case do leap seconds matter! They're irrelevant. Leaps second play no part in either TAI or POSIX time.

cbsmith9y ago

I've always used uint64's for that. Why would you need a distinct type.

icedchai9y ago

jhspaybar9y ago

Dylan168079y ago

A tangent, perhaps. But it's not FUD.

cbsmith9y ago

There's really no reason you can't provide your own timestamp structure, or your own timestamp transformation logic...

lmm9y ago

brazzledazzle9y ago

Regarding the leaking of NTP, are you talking about Systemd's default pointing at Google's NTP servers or some other event?

madgar9y ago

> designing a protocol

It's not a full protocol. It's a data type for a serialization library. You can write your own data types and they serialize just as well as the built-in types.

> that breaks interoperability

Wait, what was "broken" here? What was working before that isn't with this new release? What does this inclusion of a utility data type in a serialization library break that previously was intact?

zxv9y ago

Does this depend on use of Google's time servers?

The dependence on "smeared" leap seconds sure sounds like a dependence on such a time server.

Ouch.

Nullabillity9y ago

I can see caring about leap seconds right now, but a few seconds back or forth in the past probably won't matter very much.

manish_gill9y ago· 13 in thread

If someone better informed than me can please explain - where and why would something like Protocol Buffers be useful?

wwalser9y ago

zellyn9y ago

I should mention that we use Ruby, Java, and Go. So protobufs are also the "lingua franca" for cross-language communication.

dkopi9y ago

From https://developers.google.com/protocol-buffers/

manish_gill9y ago

5 more replies

sigil9y ago

I'll give you an example.

[0] See the ARC format used by the Internet Archive for a similar (and imo clunkier) solution. http://crawler.archive.org/articles/developer_manual/arcs.ht...

phamilton9y ago

For me there are three main advantages: schema, performance and code generation.

lordnacho9y ago

arnarbi9y ago

Serializing data for RPC, network protocols or storage, description and serialization of configuration, serializable state, serializing complex types for cryptographic signing, etc.

Why is it useful? The schema both documents the data structure and allows mappings to natural APIs in many different languages. Parsers and encoders are generated for you, and are fast.

NikhilVerma9y ago

At Badoo we use them to have a unified API for all of our platforms (Web, Mobile Web, Android, iOS, Windows Phone etc). This would not have been possible without something like ProtoBuf.

vtalwar9y ago

Nikhil, would love to know more about your architecture.

1 more reply

nawitus9y ago

When JSON is not fast/lightweight enough.

rainhacker9y ago

check this out: http://google-opensource.blogspot.com/2008/07/protocol-buffe...

zellyn9y ago· 8 in thread

- removing optional values is actually quite nice. In practice, I end up checking for "missing or empty string" anyway.

- the "well-known types" boxed primitive types essentially add optional values back in. And depending on your language bindings, may look the same.

- extensions are still allowed in proto3 syntax files, but only for options - since the descriptor is still proto2. It seems odd to build a proto3 that couldn't represent descriptors.

tantalor9y ago

> removing optional values

Slight correction: optional values are not removed. Quite the opposite; the "optional" keyword is removed because now all fields are optional. It is actually required values which were removed.

zellyn9y ago

True. But when using them, it feels like every field is "present" and you don't have to worry about the "optional, missing" case.

1 more reply

honkhonkpants9y ago

You're both right. What has been removed is the concept of presences altogether.

teacup509y ago

> - removing optional values is actually quite nice. In practice, I end up checking for "missing or empty string" anyway.

I feel the opposite; this greatly reduces the utility of protobuf.

Previously, I could trust that if parsing succeeded, then I had a guarantee of a populated data structure.

Now, I have to check each field individually, in manually written code, to verify that no required fields are missing.

That's really lame, and a huge step backwards.

smallnamespace9y ago

> I could trust that if parsing succeeded, then I had a guarantee of a populated data structure

Using required fields have actually bit Google more than once and were increasingly being considered harmful.

And if you're not running an RPC-based service architecture, then why are you using protos anyway?

2 more replies

cbsmith9y ago

peq9y ago

I agree. In particular in languages without null you will have a lot of Option types in the mapping. You no longer can generate useful type definitions from the proto spec.

madgar9y ago

> Now, I have to check each field individually, in manually written code, to verify that no required fields are missing.

You always had to check the individual fields for the zero value. A required field in a proto2 message can be set but also have the default value and pass initialization.

2 more replies

rdtsc9y ago· 6 in thread

How does this compare or in general why would you pick this vs newer formats like Cap'n'proto or FlatBuffers?

From FlatBuffers overview I see this comparison:

---

So are the newer ones useful mostly when serialization vs deserialization speed matters (https://google.github.io/flatbuffers/) ?

cbsmith9y ago

Also when you want to memory map a file/have live objects in shared memory, or in general have your in-memory & serialized structures be the same.

jokoon9y ago

I don't know, but I tried using protocol buffer once for mapbox vector files, the resulting C++ header was huge. It had templates and all sort of things, something like more than 1000 lines.

jackmott9y ago

I'm not sure it would often make much sense overall.

dwrensha9y ago

> Cap'n proto is more or less abandoned I believe

As maintainer of capnproto-rust, I beg to differ. :)

Cap'n Proto is indeed actively maintained, and here at Sandstorm we depend on it every day as a core piece of our infrastructure.

ocdtrekkie9y ago

https://github.com/sandstorm-io/capnproto/commits/master https://github.com/sandstorm-io/capnproto/pulse/monthly

cbsmith9y ago

What makes you think it is more or less abandoned?

skybrian9y ago· 5 in thread

Sadly the JSON format they chose isn't actually suitable for high-performance web apps. Web developers who use protobufs will continue to get by with various nonstandard JSON encodings.

positr0n9y ago

Why isn't is suitable? (I've never used protobufs)

skybrian9y ago

The fields are indexed by field names (converted to lower camel case) instead of tag numbers. It's great for readability, but it's a lot more verbose, particularly for repeated fields.

2 more replies

detaro9y ago

What characteristics of a JSON format would be important?

the_duke9y ago

Why would you use JSON in a high performance context anyway?

skybrian9y ago

Because the code is running in a browser.

Browsers do support binary data using typed arrays but for some reason this isn't commonly used yet. Compatibility, maybe?

1 more reply

JoachimSchipper9y ago· 4 in thread

This looks like a nice evolution.

It's a pity that the "deterministic serialization" gives so few guarantees; I have worked on at least one project that really needed this.

pherl9y ago

The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.

JoachimSchipper9y ago

(You can find the niche use case in a response to your sibling comment, BTW.)

cbsmith9y ago

In a trusted system, if you don't trust the structure you are working with, why would you trust the signature?

I'd want to always work from the signed blob.

That said, this is one reason to use flatbuffers/capt'n proto I guess: you don't have to worry about this since you never unpack the blob.

JoachimSchipper9y ago

(Some details have been changed.)

1 more reply

forrestthewoods9y ago· 4 in thread

Google also has flatbuffers. I wonder if flatbuffers is being used by enough developers to justify significant development?

https://github.com/google/flatbuffers

IshKebab9y ago

The only downside I find is that it is rather complicated (in design; not use).

forrestthewoods9y ago

As an FYI, GRPC support was added to flatbuffers a month ago. https://github.com/google/flatbuffers/tree/master/grpc

alfalfasprout9y ago

amluto9y ago

Try Cap'n Proto instead. Better designed and faster.

2 more replies

mattiemass9y ago· 3 in thread

Wow, this seems to address a bunch of problems I've experienced with protobuf in the past. Looks awesome!

grosbisou9y ago

Could you expand on the problems you encountered?

colanderman9y ago

I've never looked at proto3, but proto2 has at least the following issues:

* Abstruse implementation. There are so many layers of indirection in the generated source and the core library that I wouldn't know where to start debugging.

Not sure if they fixed any of these issues with proto3.

2 more replies

mattiemass9y ago

Dealing with forward- and backwards-compatibility with enum changes has bit me many times in the past. So has required fields.

andrewmcwatters9y ago· 3 in thread

Could someone explain to me why you would use Protocol Buffers, Cap'n Proto, etc versus rolling your own type-length-value protocol besides API interop?

What if your team could write a smaller TLV protocol, and it was necessary to keep your codebase small? Would this not be wise? Are Protobufs and party not comparable to TLV protocols?

euyyn9y ago

In the vast majority of cases, you want your team to spend their time doing something other than reinventing protos, debugging the in-house implementation, maintaining the library, etc.

andrewmcwatters9y ago

Those are all reasons why I wanted to use protobufs to begin with. It sounded like it solved many issues for us.

But in terms of small codebases: a simple TLV protocol, where only limited data types are implemented, can be 1/10th of the size of any protobufs implementation.

I'd like to use protobufs to decrease the amount of modules we have to personally maintain, but I don't see the value in doing so for our particular situation.

1 more reply

dyoo19799y ago

The efforts toward making the protocol robust might be helpful, depending on context. https://groups.google.com/d/topic/protobuf/DwyPEnvFJ-o/discu...

wehadfun9y ago· 3 in thread

In C# why use Protocol Buffer over the XML or binary serializes?

klodolph9y ago

The C# binary serializer is not really comparable in terms of what it does. It's more like Python's Pickle library.

http://stackoverflow.com/questions/703073/what-are-the-defic...

> The state of a UTF-8 or UTF-7 encoded object is not preserved if the object is serialized and deserialized using different .NET Framework versions.

(From https://msdn.microsoft.com/en-us/library/72hyey7b(v=vs.110)....)

Also see https://msdn.microsoft.com/en-us/library/ms229752(v=vs.110)....

bmm6o9y ago

recursive9y ago

Your message will be about 5% the size of the xml one, and it will be backwards compatible, unlike the built-in binary serializer.

zbjornson9y ago· 2 in thread

> primitive fields set to default values (0 for numeric fields, empty for string/bytes fields) will be skipped during serialization.

prattmic9y ago

Within the API, proto3 does not have the concept of field presence. All fields are "present" and default to their type's zero value.

Since the client can handle this, there is no need to explicitly serialize default values.

merb9y ago

and how do you send a explicit zero so that the client knows that the field is really set by the server and not the default? or a explicit empty string?

3 more replies

jalfresi9y ago· 1 in thread

"The main intent of introducing proto3 is to clean up protobuf before pushing the language as the foundation of Google's new API platform"

Does anyone know if this means Google's public APIs will be proto3 based? I quite like protobufs.

agency9y ago

They've been experimenting[1] with exposing Google Cloud Platform APIs over gRPC (which is powered by proto3), so it seems quite likely.

[1] https://cloud.google.com/blog/big-data/2016/03/announcing-gr...

gonyea9y ago

Shocking! Google's started supporting more languages than just the ones they care about. I really hope this signals the death of their disdain culture.

Being a worthwhile Cloud provider means hiring experts in all sorts of languages and supporting their efforts.

Imagine a world where Google didnt just "support node" (YEARS late), but actually turned their v8 expertise into a Cloud product.

But that'd involve convincing Java-devs-turned-VPs to care about JavaScript, <2004>and EVERYONE knows that JavaScript is a terrible language.</2004>

blt9y ago

j / k navigate · click thread line to collapse