Msgpack can't differentiate between raw binary data and text strings (opens in new tab)

(github.com)

48 pointsericz13y ago35 comments

35 comments

28 comments · 9 top-level

stock_toaster13y ago· 14 in thread

msgpack always seemed an odd thing to me. Compressed json (gzip, lz*) is small, and fast (see: http://news.ycombinator.com/item?id=4091051). If you need structure, use protocol buffers or thrift.

I actually like tnetstrings for backend messaging, but I don't see it used very often. json is pretty damn ubiquitous these days.

drewcrawford13y ago

For me the advantage of msgpack is simplicity / efficiency in parsing the response.

If you are decoding a JSON string, you run over the string twice: first to search for the terminating quote character so that you know how much memory to allocate, and second to copy the string.

Whereas with msgpack, it's more like pascal-model strings where you know the length of the string up front; you can do it one pass. You also get some structural hints that make more of the parsing a parallelizable problem, whereas with JSON it is very difficult to get any benefit out of more cores.

This may not make any difference if you are serving up an API request or three to x86 computers with broadband, but they make a lot of difference if the data you're working with is large relative to the bandwidth or latency of the pipe or the CPU power available on the other side of it.

stock_toaster13y ago

  > For me the advantage of msgpack is simplicity / efficiency in parsing the response

If you like that, you might want to take a look at tnetstrings as a format.

1 more reply

nitrogen13y ago

It's for speed. I write web software in Ruby that runs on embedded devices. This software needs to communicate with a C backend that is consuming much of the available CPU power, so there's not much time left to waste on parsing the messaging protocol. I did some tests and found that JSON is significantly slower than simple key-value pair parsing (implemented as a state machine in C), which itself is half as fast as msgpack. This doesn't even consider the overhead of gzip compression.

Edit: here are my results, which lack the numbers for msgpack except a mention at the end (I hate linking to Posterous, but I haven't moved my blog yet): http://nitrogen.posterous.com/164964342

duaneb13y ago

> I write web software in Ruby that runs on embedded devices.

Ok I gotta ask.... Why on earth would you do that? It seems like an exercise in masochism. I wasn't even aware that ruby would compile on embedded arches.

1 more reply

badgar13y ago

Have you tried using thrift or protocol buffers, which aren't self-describing (introducing enormous overhead for <1MB messages.

1 more reply

kiyoto13y ago

I think your view, while warranted, is very web programming centric. MessagePack was initially developed for RPC. That might explain some of the "oddities" when viewed as a drop-in, space-efficient replacement for JSON.

prodigal_erik13y ago

msgpack.org probably shouldn't be promoting it as a "perfect replacement", when valid messages which have no JSON equivalent have already been seen in practice. At best it's a superset, and if none of the implementations can confine you to a JSON-compatible subset (no byte strings, no int64, no non-string keys) which assures interop, that's a problem.

Mostly I wish people would agree on a schema and use ASN.1 PER, rather than choosing from the ever-growing list of binary type-length-value formats which put a redundant copy of the schema on the wire in every message (making them neither small nor readable). I've never had occasion to do anything useful with a message whose format wasn't already known when I wrote the code.

stock_toaster13y ago

> I think your view, while warranted, is very web programming centric

FWIW, I do very little "web work" (by which I assume you mean front-end?). Instead it is mostly api endpoints for mobile and server-to-server backend messaging. Lots of http transport stuff. If I were doing rpc, I would either not (and use a restful or type-2 api), or use something like protobuf or thrift.

Msgpack just doesn't seem, to me, as a really great fit for anything in particular.

1 more reply

latchkey13y ago

Agreed. I tried to use msgpack and ran into all sorts of weird recursion issues (basically, it threw up on even the simplest objects). I ended up just using gzip and it works great.

stock_toaster13y ago

I think if you control both ends of the interaction, lz4 or snappy may be interesting to use. But for most things, I do think gzip is better supported (especially http).

1 more reply

kiyoto13y ago

Which language's binding did you use? MessagePack is a protocol, not an implementation. Anyway, if gzipping worked great for you, that sounds like the way to go =)

makmanalp13y ago

I think the problem with that analysis is that it ignores the gzip compression / decompression time. Also, that file is so uniform that it probably presents a best-case for gzip compression.

mistercow13y ago

It seems unlikely to me that the compression/decompression time would outweigh the time to transmit 382 extra KB of data, but I suppose it depends on the application and how valuable computation time is on the server side.

azov13y ago

Msgpack library is almost an order of magnitude smaller then protobuf lite, and almost 100 times smaller then full protobuf. This is not counting the size of generated protocol classes. There are situations when size matters :-)

drewcrawford13y ago· 2 in thread

Bikeshedding at its finest.

I ran into this issue a few months ago, on a cross-platform project involving four languages that each take a distinctly different view about strings from the other three. Although this situation is a common objection to supporting strings in the issue thread, it took just a couple of hours to extend msgpack to support strings in a reasonable-enough-for-me way on each platform.

The proposals in the thread are a lot better than mine. And I suppose it's pretty antisocial / arrogant for me to just roll my own implementation without consulting anybody. But in three years[0] of talking about the problem, nothing had gotten done. Meanwhile, my code shipped a long time ago.

I do this a lot--fork people's projects to solve my problems and don't merge back changes--and I feel guilty for not being more participatory with the project maintainers. But the fact is that the expected cost of getting embroiled in a flamewar like this is high (whether it is over architecture, whitespace convention, "behavior by design", "Jim's already working on that", etc.), whereas the benefit to me of getting my changes merged upstream is essentially zero. So my antisocial behavior continues to be positively reinforced.

Does anyone else have this problem? Or do people just enjoy flamewars more than I do, or have the persuasive skills to avoid them?

[0] https://github.com/msgpack/msgpack/issues/13

vidarh13y ago

I generally take the approach that I fix the issue, add a comment and ask if they want a pull request and don't do anything else unless the maintainer expresses interest. If they do express interest, I'll go pretty far in trying to clean up my fixes to make them suitable, as long as they still solve my problem. If they don't express any interest, oh well, my fork will be there and a comment will be there to point other people to a viable solution.

A lot of the time the response is very welcoming. E.g. I recently provided a substantial patch to Beaneater (Beanstalkd client library for Ruby) and the maintainers were all over it immediately, and we got it merged in quickly.

The benefit of taking the effort is to be able to keep up with upstream without having to reapply patches. But that benefit is limited (often I will prefer to stay with an "old" known entity rather than tracking upstream, as long as security concerns don't force me to upgrade), and so I don't spend a lot of time pursuing it.

I strongly believe code speaks louder than words in this kind of situation, and often shipping code will be more likely to get acceptance than engaging in discussions.

wallrat13y ago

Same here.

Last week it was needing Redis slaves to handle 'SLAVEOF NO ONE' from the master without crashing. Needed to tell all read-slaves (hundreds) to stop trying to reconnect when taking the master down.

It's fine balance though, you don't want to be stuck with too many forks to maintain.

bengotow13y ago· 2 in thread

I'm fine with the fact that Msgpack does not differentiate between binary data and text strings. Sure, it requires a schema, but if you're concerned with data size and parsing speed, you should choose an encoding appropriate for your task anyway.

The bigger problem is that Msgpack is advertised as being "like JSON, but fast and small." To me, that makes it sound like I can replace JSON messages with Msgpack messages and be done, and that's not at all the case, because I need to add a schema layer. I think the "like JSON" comparison is what is really causing this frustration with the format.

chubs13y ago

Hi, I wrote (with another guy) the objective-c wrapper.

You might be misinformed re schema layers, as msgpack does convert to and from a dictionary in much the same way that JSON does, keeping all your dictionary keys (which are strings) intact. In fact, we originally used it as a drop-in replacement for JSON.

As for the data vs string issue, it was designed originally to be as conveniently similar to JSON as possible - which is why you don't get back a dictionary full of NSData's which you then need to convert manually to NSString's; it does that automatically for you. This was a convenience vs correctness tradeoff. People who say it's wrong are quite right. They're very welcome to fork it, or submit patches with options to return raw NSData, or create a new wrapper - it wouldn't take a competent dev very long to re-write what we did.

Now, i've not used messagepack in quite a while - i've simply found that gzipped json is usually almost as good.

bengotow13y ago

Thanks for replying. When I explored Msgpack, it was the Objective-C library that I tried using. Overall it was a great experience - nice work on the wrapper. I think you're right - I was trying to use Msgpack to do more than I could do with JSON (namely, to transmit NSData without having to stringify it). When I realized all the NSData objects were being automatically converted into strings when the data structure was inflated, I figured I'd need to prevent that behavior and do it on only certain keys (which would need to be specified somehow, hence my thought of a schema). Thanks for clarifying!

jrmg13y ago· 1 in thread

The conflict here seems to be between people who think any arbitrary valid msgpack stream should be decodable into a specific object graph, and those who assume msgpack will be used to implement a protocol where only messages of a predefined format should be allowed - hence the decoding app will know beforehand what should be a string and what shouldn't.

The conflict is unresolvable until the participants agree on which of these two distinct things msgpack should be.

dietrichepp13y ago

I don't see the conflict here. Spend one bit per string encoding whether it's UTF-8 data or a binary blob.

* The people who use it to implement protocols already have to deal with types, e.g., expected a number but got a string. So one more type is not a big deal.

* The people who use it to create discoverable profiles will... use JSON no matter how good MessagePack gets.

That's not the direction I was headed when I started writing, but I don't think the first group you mentioned even exists.

Groxx13y ago

... by design. Because there's no "string" type. This is a bug report about high level implementations that don't encode and decode in reversible ways, contrary to the msgpack protocol. http://wiki.msgpack.org/display/MSGPACK/Format+specification

Really, for a protocol that values minimal space usage, not defining a string type is probably a good thing. Use the one that produces the fewest bytes in your application - it may not be UTF-8.

Also:

>For instance, the objective C wrapper is currently broken because it tries to decode all raw bytes into high-level strings (through UTF-8 decoding) because using a text string (NSString) is the only way to populate a NSDictionary (map).

Well there's your problem: https://github.com/msgpack/msgpack-objectivec/blob/master/Me... It's a buggy wrapper that's trying to be convenient. And NSString keys are by no means the only way to populate an NSDictionary, and it doesn't look like the Objective-C wrapper requires this: https://github.com/msgpack/msgpack-objectivec/blob/master/Me...

1 more reply

Confusion13y ago

Well, then the information of whether some sequence of bytes is a string needs to be communicated out of band. That's a perfectly acceptable design decision, but one that may lead potential users to favor alternatives that include that information in band.

The discussion is pointless if the objectives of the participants differ and none is willing to compromise.

lnanek213y ago

Reading TFA apparently there is no string type, so nothing in it is a string. It's all binary data, byte array or whatever that means in your language of choice. If an application or a library the application uses converts a string into binary data before and converts it back after, that's none of the format's business.

ambrop713y ago

The solution to these problems is for everyone to be completely ignorant of any character encoding and just deal with octets. If the characters represent UTF-8 text, then only when text needs to be presented or interpreted in some way, UTF-8 decoding happens. Any automatic encoding or decoding of UTF-8 (such as what Python3 does) is stupid.

EDIT: A common example of implicit and wrong handling of character encoding is when a file gets created with invalid characters, and your Linux file manager is unable to delete it. This can happen because the file manager assumes the file names it gets from the OS are text, and which it decodes incompletely. When it wants to delete the file it encodes the text back, but the result is different than the original file name bytes. The error happens because the file manager tries to decode the as text too early - it should keep the original octets as a reference to the file, but only decode them when it needs to display a file name.

Beltiras13y ago

I can very easily froth at the mouth when it comes to character encoding problems. It's one of those problems that should never even be a problem, but ends up consuming hours upon hours of consulting arcane cobwebby specs.

j / k navigate · click thread line to collapse

35 comments

28 comments · 9 top-level

stock_toaster13y ago· 14 in thread

msgpack always seemed an odd thing to me. Compressed json (gzip, lz*) is small, and fast (see: http://news.ycombinator.com/item?id=4091051). If you need structure, use protocol buffers or thrift.

I actually like tnetstrings for backend messaging, but I don't see it used very often. json is pretty damn ubiquitous these days.

drewcrawford13y ago

For me the advantage of msgpack is simplicity / efficiency in parsing the response.

If you are decoding a JSON string, you run over the string twice: first to search for the terminating quote character so that you know how much memory to allocate, and second to copy the string.

stock_toaster13y ago

  > For me the advantage of msgpack is simplicity / efficiency in parsing the response

If you like that, you might want to take a look at tnetstrings as a format.

1 more reply

nitrogen13y ago

Edit: here are my results, which lack the numbers for msgpack except a mention at the end (I hate linking to Posterous, but I haven't moved my blog yet): http://nitrogen.posterous.com/164964342

duaneb13y ago

> I write web software in Ruby that runs on embedded devices.

Ok I gotta ask.... Why on earth would you do that? It seems like an exercise in masochism. I wasn't even aware that ruby would compile on embedded arches.

1 more reply

badgar13y ago

Have you tried using thrift or protocol buffers, which aren't self-describing (introducing enormous overhead for <1MB messages.

1 more reply

kiyoto13y ago

prodigal_erik13y ago

stock_toaster13y ago

> I think your view, while warranted, is very web programming centric

Msgpack just doesn't seem, to me, as a really great fit for anything in particular.

1 more reply

latchkey13y ago

Agreed. I tried to use msgpack and ran into all sorts of weird recursion issues (basically, it threw up on even the simplest objects). I ended up just using gzip and it works great.

stock_toaster13y ago

I think if you control both ends of the interaction, lz4 or snappy may be interesting to use. But for most things, I do think gzip is better supported (especially http).

1 more reply

kiyoto13y ago

Which language's binding did you use? MessagePack is a protocol, not an implementation. Anyway, if gzipping worked great for you, that sounds like the way to go =)

makmanalp13y ago

I think the problem with that analysis is that it ignores the gzip compression / decompression time. Also, that file is so uniform that it probably presents a best-case for gzip compression.

mistercow13y ago

azov13y ago

drewcrawford13y ago· 2 in thread

Bikeshedding at its finest.

Does anyone else have this problem? Or do people just enjoy flamewars more than I do, or have the persuasive skills to avoid them?

[0] https://github.com/msgpack/msgpack/issues/13

vidarh13y ago

I strongly believe code speaks louder than words in this kind of situation, and often shipping code will be more likely to get acceptance than engaging in discussions.

wallrat13y ago

Same here.

Last week it was needing Redis slaves to handle 'SLAVEOF NO ONE' from the master without crashing. Needed to tell all read-slaves (hundreds) to stop trying to reconnect when taking the master down.

It's fine balance though, you don't want to be stuck with too many forks to maintain.

bengotow13y ago· 2 in thread

chubs13y ago

Hi, I wrote (with another guy) the objective-c wrapper.

Now, i've not used messagepack in quite a while - i've simply found that gzipped json is usually almost as good.

bengotow13y ago

jrmg13y ago· 1 in thread

The conflict is unresolvable until the participants agree on which of these two distinct things msgpack should be.

dietrichepp13y ago

I don't see the conflict here. Spend one bit per string encoding whether it's UTF-8 data or a binary blob.

* The people who use it to implement protocols already have to deal with types, e.g., expected a number but got a string. So one more type is not a big deal.

* The people who use it to create discoverable profiles will... use JSON no matter how good MessagePack gets.

That's not the direction I was headed when I started writing, but I don't think the first group you mentioned even exists.

Groxx13y ago

Really, for a protocol that values minimal space usage, not defining a string type is probably a good thing. Use the one that produces the fewest bytes in your application - it may not be UTF-8.

Also:

1 more reply

Confusion13y ago

The discussion is pointless if the objectives of the participants differ and none is willing to compromise.

lnanek213y ago

ambrop713y ago

Beltiras13y ago

j / k navigate · click thread line to collapse