JSON truly is the new XML, both in terms of advantages and disadvantages. I wish people would stop using it for everything and realise that using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.
And the problem here wasn't JSON in the first place anyway, it was
> The “back” stack is unlimited, meaning we’re saving more and more state until we explode
And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>
Length-delimited formats don't need (exponentially increasing) escaping.
Yes it was. They stored state in a huge JSON dump which ended up working like a zip bomb by the way they had to escape a string. This alone resulted in an exponential increase in memory that quickly became a problem.
Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.
Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.
It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).
It also is a poor non-human readable format, as it lacks crucial types (datetime, decimal) is rather verbose (just imagine the terabytes bandwidth wasted, globally on all these commas and curly braces) and cannot be parsed streaming.
Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.
Decimal is there, isn’t it, as part of ‘number’, which lumps together fixed-size integers, IEEE floats and doubles, bigints and bigfloats, and leaves it to the parser to make the right guess as to how to interpret them (often assisted by having the programmer specify a target data type)?
Lumping all those together as ‘number’ makes for a much simpler grammar, but I think that’s not worth it for its negative effects on interoperability. How many json parsers can (easily) read a value of 1234567890123456890E0000000001 and write back the same string to a json file?
Apart from datetime, I’d add duration as a basic type, too, even though that has the problem of being imprecise. Users will want to specify durations such as “1m” or “1y” (is that 365 days? 365.2425 to correct for leap years? Does it depend on interpretation by the user? is it a lunar year of 354/355 days as in the Islamic calendar? Etc)
The issue is that JSON ended up being used as a configuration format too, and comments are critical there.
JSON5 is the best answer I have found.
2. I’m still not convinced it’s the file format’s job to encode every possible data type. We used to have deserialization layers for that, but for some reason, we don’t do that anymore. Nowadays we expect every file format to be self-describing, which I think is wasteful and limiting. But that’s a matter of opinion, obviously.
3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.
Database dumps say hello. :)
And devs don't care about just config files put API requests and responses and any other kind of structured/semi-structured communication and data.
In the encode/decode library we use, the generated toString were accidentally so close to json, that we made them exactly so (mostly had to add quotes). Now we can use json tooling on top of efficient messaging.
And anything that calculates individual users is the wrong way to do this math.
However what does matter is that the authors probably is not estimating the average correctly partially due to this bug. If they scaled without fixing their excessive string storage, they would probably find their estimate to be off.
Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.
jo.put("previous", previous);
Rather than jo.put("previous", previous.toJson());
Presumably the reason they didn't do this is because they couldn't handle a recursive schema?Recursively serialized JSON, XML, whatever is a code smell in general.
(Note that outside certain examples like this, in general BSON is not smaller than JSON. You'd need something like protobufs. )
1 \
2 \\
4 \\\\
8 \\\\\\\\
We see it grows exponentially: to escape each backslash, you need two. So I guess I’m surprised that this is a post about a 1.5GB string and not the exponentially increasing work on each navigation causing performance problems, or the session crashing from running out of memory when a user navigates too far. Maybe the service wasn’t being used much yet or there’s some other reason that strings only grew to ridiculous sizes and not oom sizes. I’m curious why it didn’t exhibit sooner. (I saw a somewhat similar problem long ago but then it was more like doing ‘this.json = this.messages_array.to_json()’ on every new message, so merely quadratic. I think it was noticed not long after we first had a lot of messages added to one of the objects)So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.
It always boggles my mind that someone looked at this example of a working design and concluded that the escape character for a regular expression engine should be \, overlapping with the escape character for a literal string.
I guess the history wasn’t that deep.
Leaving aside the need to send a massive set of screen images for the moment, a severe issue with that protocol is the lack of compression. It doesn't need to use a quadtree[1], although that will significantly speed up the interface when sliding the tree depth to the screen action being taken. Even RLE[2], Huffman[3], or any other standard string compression[4][5] would solve this issue shown in the blog.
[1] https://en.wikipedia.org/wiki/Quadtree#Compressed_quadtrees
[2] https://en.wikipedia.org/wiki/Run-length_encoding
[3] https://en.wikipedia.org/wiki/Huffman_coding
[4] https://en.wikipedia.org/wiki/Data_compression#Image
[5] https://en.wikipedia.org/wiki/Lossless_compression#General_p...
I don't think they do though?
let obj = { id: 1 }
for (let id = 2; id <= 10; id++) {
obj = { id, previous: JSON.stringify(obj) }
}If this sounds like a cons cell, that’s what it is! It’s also a linked list. The laziness they want to achieve (only de/serialize one screen at a time per navigation forward or back) is also straightforward.
- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket
- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history
There’s a tiny amount of overhead to this approach, but not nearly as much as repeatedly reserializing the same string to shoehorn it into a serialized structure that doesn’t let you cheat a little bit with well known start/end characters.
let obj = { id: 1 }
for (let id = 2; id <= 10; id++) {
obj = { id, previous: obj }
}While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,
> e.g. examining a memdump rather than just measuring overall memory utilization
A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?
Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.
A few remarks:
1. Team: Nitzan (the blog author, and an awesome dude!) was at the time the Production Engineer monitoring the top-line capacity metrics. The Java heapdump tooling was hacked together by E.A., with some tough bits by A.S., who's a force of nature. The bulk of the mitigation was carried out by Y.B. over several months. I was the person who analyzed a couple of memory dumps and framed URL strings as a worthwhile 80/20 goal.
2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like
semanticallyUsefulURL = decodeURLString(urlStoredInStringForm)
What Y.B. ended up doing was
2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,
2.2. Carrying out a massive automated rewrite at the source level ("codemod"), which is possible in Java but not for the faint of heart. I think he used some JetBrains tooling to get that going. I consider his work a tour de force.
I vaguely recall there being some modest CPU improvements in the process.
3. Organizational dynamics: the codebase was originally written by very competent people who did not work through that specific detail because it was not important for their original use case. However subsequently the code underwent several years of almost solely rewarding improvement in end-user metrics, leading to an overall inadequate state that was hurting the organization as it was hitting hard scaling limits. In fact, "engineering excellence" was not even a category for recognition at that time. In this climate I could not find my own voice as a SW professional and chose to quit after a little bit more than a year.
I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.
(Edited - added one more due prop!)
Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know. Good times.
On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.
But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.
This means that strings containing only ASCII characters can be up to about 2 billion in length, whereas strings with real Unicode content can only be up to about 1 billion in length. This is a bit of an unfortunate regression in functionality.
Seems like somewhere along the line reality outgrew the original requirements.
> creating a dedicated real stack with self-imposed size limits and reporting.
this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded
the backslashes encoding stuff is a symptom
nothing i wrote is in any way controversial?
obviously you can't embed unbounded history in session state directly?
I read it as some service with a custom client application.
@keyframes intro {
0% { opacity: 0 }
100% { opacity: 1 }
}
Applied via: animation: intro 0.3s both;
animation-delay: 0.15s;> I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.
> And I went... hey, I think I know that particular circus!
The previous session data probably needs to be deserialized from json to objects first before adding in the new session object. The whole container can then be serialized to json at one shot. This avoids the recursive escape encoding.
Genomics data has entered the chat.