RFC 7049 - Concise Binary Object Representation (CBOR) (opens in new tab)

(tools.ietf.org)

81 pointsyorhel12y ago52 comments

52 comments

46 comments · 16 top-level

ghoul212y ago· 7 in thread

There is one significant problem I see:

the length field for compound types (arrays and maps) specify the length in "the number of items", not in bytes. This means while processing, If I need to skip a compound type, I actually need to process it in its entirety. Not very "small device" friendly.

In practice, I have found far more utility in knowing the byte-length of a compound field in advance than the number of items it contains. If I am interested in the field, I am anyway going to find out the number of items cause I am going to process it. If I am not interested in the field, the number of items are useless to me, but the byte-length would have come in handy.

mwcremer12y ago

I think the thinking here is that the sender may not be able to compute the byte size of the object a priori. Think HTTP chunked encoding.

ghoul212y ago

I understand that is a concern in many situations. The problem here though is that you don't get the "streaming" benefits in any case: you still have to include the length-in-number-of-items of the compound type and the lengths of each individual member item in any case.

angersock12y ago

I'm a bit concerned about the "indefinite length" stuff for arrays, buffers, and strings.

That seems like something that's going to come back and byte us.

Arelius12y ago

Because we can store enough things to fill memory?

1 more reply

slavio12y ago

I agree. I think CBOR trades off a bit too much efficiency of in-place data access for compactness of representation.

simias12y ago

Isn't the whole point of binary serialization formats efficiency and ease of parsing? Otherwise you might as well use .json.gz and probably end up with smaller files anyway.

1 more reply

jokoon12y ago

Well just write the file type to deduce the size then.

I'm not sure I understand the problem you describe, really.

Even if there are string, just encode their lengths, or if you store a compound type, write the size when size can vary.

roncohen12y ago· 5 in thread

They should do the world a favor and include a datetime type.

samatman12y ago

Indeed, this is a known terrible mistake, easily avoided.

The lack of the string "UUID" in the RFC is also cause for concern.

MichaelGG12y ago

They have a tag for date strings, or you can use seconds from epoch, as an integer _or floating-point_. So if you actually want to represent time with proper fractional seconds, you're stuck representing them as strings. Hardly concise.

Someone12y ago

Whats terribly wrong with http://tools.ietf.org/html/rfc7049#section-2.4.1?

drdaeman12y ago

The minor issues are missing timezone and precision information.

But, most importantly, use of integers for datetime values hides type-level semantics. It's just integers and you, the end user, and not the deserializer, is responsible for handling the types.

I think it's quite inconvenient to do tons of `data["since"] = parse_datetime(data["since"])` all the time, for every model out there.

1 more reply

jessaustin12y ago

Amen! How is this not an obvious step by this point?

stevecooperorg12y ago· 3 in thread

Anyone care to comment on where we might use such a thing? Is it already in use? And does it compare favourably with BSON?

moondowner12y ago

In Appendix E in the spec, named "Comparison of Other Binary Formats to CBOR's Design Objectives" there are several comparisons - including BSON:

   [BSON] is a data format that was developed for the storage of JSON-
   like maps (JSON objects) in the MongoDB database.  Its major
   distinguishing feature is the capability for in-place update,
   foregoing a compact representation.  BSON uses a counted
   representation except for map keys, which are null-byte terminated.
   While BSON can be used for the representation of JSON-like objects on
   the wire, its specification is dominated by the requirements of the
   database application and has become somewhat baroque.  The status of
   how BSON extensions will be implemented remains unclear.

lucian190012y ago

It looks closer to msgpack if anything, but with actual strings and bytes.

memracom12y ago

It is inspired by msgpack. If you read the RFC then you will see that the authors like msgpack but have some different requirements.

ctz12y ago· 2 in thread

Avoiding the need for protocol version negotiation might be a useful feature in some systems, but it seems to me that the things you lose makes it really not worth it. Particularly, a protocol without atoms invariably ends up like most JSON APIs -- very 'stringly typed', somewhat poorly defined, and verbose on the wire.

Which is strange for a thing calling itself 'concise'.

craftit12y ago

It does seem an odd trade off. Having key value pairs is great for prototyping and the keys make it easier for people to interpret the messages and to write code to use them. On the other hand repeatedly sending readable key values seems a huge waste. I guess when streaming you could send a header with a map in it, but it then makes things complicated....

eli12y ago

Could something like GZIP mask a lot of that repetition?

1 more reply

huhtenberg12y ago· 2 in thread

Serialization formats are like indentation styles. Dead easy to pick or invent one, nearly impossible to convince others to switch to it.

albertzeyer12y ago

Yes! For that reason, I also have my own: https://github.com/albertz/binstruct

6ren12y ago

The significant thing about XML is not that it's any good, but that everyone switched to it. I predicted this meant XML would win forever... but JSON does seem to be catching up.

kabdib12y ago· 2 in thread

This gets a surprising number of things right. I've worked on a couple of these. In particular I'm delighted to see both the definite and indefinite streams of things.

I'm a little bit tired (well, more than a little tired) of standards that aren't couched in terms that are directly executable. English descriptions and psuedo-code are fine, but in the end I want to have some working code that implements an API for the stuff. Doesn't have to be an official API, but something usable shows me that (a) it is indeed usable, and (b) will go a long way towards heading off other people's mistakes.

We don't do crypto without test vectors. I don't know why we think we can do other complex standards without test vectors, either. (I worked on NBS / NIST in the 70s on some verification suites. Have we lost that practice?)

I think that much of what is busted on the modern web can be traced back to loose english and lack of reference code (even stuff with placeholders). CSS, HTML, etc., I'm looking at you... :-/

tveita12y ago

> In particular I'm delighted to see both the definite and indefinite streams of things.

Why? I can see the advantages of either one, but I don't see what having both gets you.

In my experience the implementation advantages of having length-prefixed lists disappear if you have to support indefinite lengths anyway.

kabdib12y ago

I want to use the same data structures for

- Passing small messages around

- Doing streaming of large content (occasionally)

I'm probably doing these over different pipes, but the data shares a lot of the same characteristics and I don't want to use two totally different APIs to get the job done.

"Large" can be "I need to transfer something on the order of megabytes using a 4K intermediate buffer."

MichaelGG12y ago· 2 in thread

Can anyone enlighten me on why number equivalency is a good idea? The spec says that even if you're expecting an integer like 0, encoders can decide to use floating point, and things should just work. One of the first statements is that "7" should be able to be represented in multiple ways. That doesn't seem concise.

arh6812y ago

Hmm, that's not the impression I got. I don't think they're arguing you should use multiple encodings willy nilly. Rather, they're avoiding the limit of exactly 1 encoding for every input (maximum flexibility in the spec).

Of course, in real-world implementations, the encoder and the decoder will have a shared view of what should be in a CBOR data item. For example, an agreed-to format might be "the item is an array whose first value is a UTF-8 string, second value is an integer, and subsequent values are zero or more floating-point numbers" or "the item is a map that has byte strings for keys and contains at least one pair whose key is 0xab01".

7 is 7 whether it's uint_8 or uint_32, right?

arh6812y ago

Also, there's actually a much more relevant section later on in the spec (just got to p18):

   For constrained
   applications, where there is a choice between representing a specific
   number as an integer and as a decimal fraction or bigfloat (such as
   when the exponent is small and non-negative), there is a quality-of-
   implementation expectation that the integer representation is used
   directly.

LeafStorm12y ago· 2 in thread

This looks like a fairly well-designed format. My main concern is that this seems to have suddenly appeared out of nowhere and gone directly to RFC. (Presumably there was an Internet-Draft, but I have never seen anything about this before.)

ape412y ago

These kind of binary formats always have vulnerabilities. eg http://technet.microsoft.com/en-us/security/bulletin/ms04-00...

AsymetricCom12y ago

It would be up to the parser to implement the standard without a vulnerability, but a protocol is a language and a language can be designed to be self-referential, hypocritical, inconstant etc, making a conforming parser impossible. A lot of these so called "living standards" are probably not "evolving" so much as partially classified.

mbq12y ago· 2 in thread

At least it doesn't copy the JSON's braindead idea to rule out NaNs and Infs...

angersock12y ago

Wait, what? I hadn't heard about that. Whatfuck?

mbq12y ago

See: http://stackoverflow.com/questions/1423081/json-left-out-inf...

In short, this is because JS does not treat NaN and Infinity as numerical constants but as pre-defined, mutable variables; this way backward-compatible parsing of hypothetical sane JSON with eval would be vulnerable to injection. Nevertheless, many JSON codecs have their own idea what to do with it, so this stuff can get really nasty.

otikik12y ago· 2 in thread

I like it. Fighting the urge to write a parser for it in my language of choice.

michaelmior12y ago

I couldn't fight it off https://github.com/michaelmior/pycobr

Just got the encoder so far (without major type 6, i.e. tagging) and the code is pretty messy and possibly not 100% correct, but it's true that the amount of code required is pretty minimal.

michaelmior12y ago

Update: Fixed a bunch of bugs in the encoder and have a working decoder as well. Still no tagging, but you can encode/decode pretty much anything you could with a naive JSON implementation.

jameshart12y ago· 1 in thread

RFCs don't amount to much without adoption. The RFC database is full of protocols with grand designs and seemingly broad applicability. Look at the "Extensible Provisioning Protocol", EPP - http://tools.ietf.org/html/rfc5730 - a protocol "for the provisioning and management of objects stored in a shared central repository." - it reads as a marvelously generic protocol for client-managed key-value data storage - maybe it's suited for caching systems, or cloud BLOB storage, or as an abstraction of dropbox... but in reality it's just the protocol used by internet domain registrars to manage domain name registrations on a registry server - the nichiest of niche applications, yet the subject of a dozen RFCs. It's not going to be picked up and supported by Hadoop or Dropbox or anybody else who needs client managed obect storage, they're going to stick to HTTP REST.

This CBOR format is being proposed by the VPN Consortium - presumably there's some specific VPN interoperability application they have in mind for this. In the meantime, everybody else will continue to use compressed JSON, or protocol buffers, or whatever other standards have good library support and interoperability and - crucially - adoption in their domain.

na8512y ago

I agree with all the points you've made, and I haven't read this RFC beyond a quick skim, but consider:

-a lot of the time, a dearth of implementations of a new Thing is not because the new Thing is bad, but simply because people are change-averse and lazy, even in the face of an objectively better Thing, and

-I still consider this a quality submission; even if CBOR doesn't get adopted it's still neat to read. It's like watching one's government draft new legislation, except more relevant.

craftit12y ago

Looks a good spec, great as a way of sending data to 'Internet Of Things' style devices where processing power and possibly bandwidth are issues.

thomseddon12y ago

I would really love to see a convergence of such binary formats, I hate that choosing between Google's Protocol Buffers, Apache (Facebook) Thrift etc. forces you down a very specific path of non-interoperable libraries.

I would like to see how this compares to other formats with respect to serialised size...

slavio12y ago

Any JSON object encoding format would greatly benefit from compression. It does not have to be complicated: even something as simple as using a dictionary array of "symbols" whose indices can be used instead of repeating string values.

430gj9j12y ago

In the examples, 0x3bffffffffffffffff decodes to -18446744073709551616, which doesn't fit into int64_t. Why didn't they switch to bignums after INT64_MIN (-9223372036854775808) instead? Seems a bit asymmetric.

YesThatTom212y ago

I was really starting to like capnproto.com

j / k navigate · click thread line to collapse

52 comments

46 comments · 16 top-level

ghoul212y ago· 7 in thread

There is one significant problem I see:

mwcremer12y ago

I think the thinking here is that the sender may not be able to compute the byte size of the object a priori. Think HTTP chunked encoding.

ghoul212y ago

angersock12y ago

I'm a bit concerned about the "indefinite length" stuff for arrays, buffers, and strings.

That seems like something that's going to come back and byte us.

Arelius12y ago

Because we can store enough things to fill memory?

1 more reply

slavio12y ago

I agree. I think CBOR trades off a bit too much efficiency of in-place data access for compactness of representation.

simias12y ago

Isn't the whole point of binary serialization formats efficiency and ease of parsing? Otherwise you might as well use .json.gz and probably end up with smaller files anyway.

1 more reply

jokoon12y ago

Well just write the file type to deduce the size then.

I'm not sure I understand the problem you describe, really.

Even if there are string, just encode their lengths, or if you store a compound type, write the size when size can vary.

roncohen12y ago· 5 in thread

They should do the world a favor and include a datetime type.

samatman12y ago

Indeed, this is a known terrible mistake, easily avoided.

The lack of the string "UUID" in the RFC is also cause for concern.

MichaelGG12y ago

Someone12y ago

Whats terribly wrong with http://tools.ietf.org/html/rfc7049#section-2.4.1?

drdaeman12y ago

The minor issues are missing timezone and precision information.

But, most importantly, use of integers for datetime values hides type-level semantics. It's just integers and you, the end user, and not the deserializer, is responsible for handling the types.

I think it's quite inconvenient to do tons of `data["since"] = parse_datetime(data["since"])` all the time, for every model out there.

1 more reply

jessaustin12y ago

Amen! How is this not an obvious step by this point?

stevecooperorg12y ago· 3 in thread

Anyone care to comment on where we might use such a thing? Is it already in use? And does it compare favourably with BSON?

moondowner12y ago

In Appendix E in the spec, named "Comparison of Other Binary Formats to CBOR's Design Objectives" there are several comparisons - including BSON:

   [BSON] is a data format that was developed for the storage of JSON-
   like maps (JSON objects) in the MongoDB database.  Its major
   distinguishing feature is the capability for in-place update,
   foregoing a compact representation.  BSON uses a counted
   representation except for map keys, which are null-byte terminated.
   While BSON can be used for the representation of JSON-like objects on
   the wire, its specification is dominated by the requirements of the
   database application and has become somewhat baroque.  The status of
   how BSON extensions will be implemented remains unclear.

lucian190012y ago

It looks closer to msgpack if anything, but with actual strings and bytes.

memracom12y ago

It is inspired by msgpack. If you read the RFC then you will see that the authors like msgpack but have some different requirements.

ctz12y ago· 2 in thread

Which is strange for a thing calling itself 'concise'.

craftit12y ago

eli12y ago

Could something like GZIP mask a lot of that repetition?

1 more reply

huhtenberg12y ago· 2 in thread

Serialization formats are like indentation styles. Dead easy to pick or invent one, nearly impossible to convince others to switch to it.

albertzeyer12y ago

Yes! For that reason, I also have my own: https://github.com/albertz/binstruct

6ren12y ago

The significant thing about XML is not that it's any good, but that everyone switched to it. I predicted this meant XML would win forever... but JSON does seem to be catching up.

kabdib12y ago· 2 in thread

This gets a surprising number of things right. I've worked on a couple of these. In particular I'm delighted to see both the definite and indefinite streams of things.

I think that much of what is busted on the modern web can be traced back to loose english and lack of reference code (even stuff with placeholders). CSS, HTML, etc., I'm looking at you... :-/

tveita12y ago

> In particular I'm delighted to see both the definite and indefinite streams of things.

Why? I can see the advantages of either one, but I don't see what having both gets you.

In my experience the implementation advantages of having length-prefixed lists disappear if you have to support indefinite lengths anyway.

kabdib12y ago

I want to use the same data structures for

- Passing small messages around

- Doing streaming of large content (occasionally)

I'm probably doing these over different pipes, but the data shares a lot of the same characteristics and I don't want to use two totally different APIs to get the job done.

"Large" can be "I need to transfer something on the order of megabytes using a 4K intermediate buffer."

MichaelGG12y ago· 2 in thread

arh6812y ago

7 is 7 whether it's uint_8 or uint_32, right?

arh6812y ago

Also, there's actually a much more relevant section later on in the spec (just got to p18):

   For constrained
   applications, where there is a choice between representing a specific
   number as an integer and as a decimal fraction or bigfloat (such as
   when the exponent is small and non-negative), there is a quality-of-
   implementation expectation that the integer representation is used
   directly.

LeafStorm12y ago· 2 in thread

ape412y ago

These kind of binary formats always have vulnerabilities. eg http://technet.microsoft.com/en-us/security/bulletin/ms04-00...

AsymetricCom12y ago

mbq12y ago· 2 in thread

At least it doesn't copy the JSON's braindead idea to rule out NaNs and Infs...

angersock12y ago

Wait, what? I hadn't heard about that. Whatfuck?

mbq12y ago

See: http://stackoverflow.com/questions/1423081/json-left-out-inf...

otikik12y ago· 2 in thread

I like it. Fighting the urge to write a parser for it in my language of choice.

michaelmior12y ago

I couldn't fight it off https://github.com/michaelmior/pycobr

Just got the encoder so far (without major type 6, i.e. tagging) and the code is pretty messy and possibly not 100% correct, but it's true that the amount of code required is pretty minimal.

michaelmior12y ago

Update: Fixed a bunch of bugs in the encoder and have a working decoder as well. Still no tagging, but you can encode/decode pretty much anything you could with a naive JSON implementation.

jameshart12y ago· 1 in thread

na8512y ago

I agree with all the points you've made, and I haven't read this RFC beyond a quick skim, but consider:

-I still consider this a quality submission; even if CBOR doesn't get adopted it's still neat to read. It's like watching one's government draft new legislation, except more relevant.

craftit12y ago

Looks a good spec, great as a way of sending data to 'Internet Of Things' style devices where processing power and possibly bandwidth are issues.

thomseddon12y ago

I would like to see how this compares to other formats with respect to serialised size...

slavio12y ago

430gj9j12y ago

YesThatTom212y ago

I was really starting to like capnproto.com

j / k navigate · click thread line to collapse