A 1.5GB string (opens in new tab)

(blog.backslasher.net)

308 pointsBackslasher3y ago138 comments

138 comments

97 comments · 23 top-level

userbinator3y ago· 21 in thread

Ironically, not long ago there was an article about how hard string handling is in C, so programs tend to be written to avoid manipulating strings as much as possible. In other languages, where strings are easy to use, it encourages inefficiency like this. The term "stringly typed" also comes to mind.

JSON truly is the new XML, both in terms of advantages and disadvantages. I wish people would stop using it for everything and realise that using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

ilyt3y ago

That's a massive exaggeration. XML is both much more complex (and bug-prone) to parse, and wastes more space.

And the problem here wasn't JSON in the first place anyway, it was

> The “back” stack is unlimited, meaning we’re saving more and more state until we explode

And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>

userbinator3y ago

You'd have same problem if you tried to serialize <format> into a field in <format>

Length-delimited formats don't need (exponentially increasing) escaping.

simplotek3y ago

> And the problem here wasn't JSON in the first place anyway

Yes it was. They stored state in a huge JSON dump which ended up working like a zip bomb by the way they had to escape a string. This alone resulted in an exponential increase in memory that quickly became a problem.

Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.

Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.

1 more reply

deepsun3y ago

But why do you bring XML? The parent is against human-readable formats, and XML is one of them.

berkes3y ago

I'd argue that JSON is one of the least Human Readable serialization formats.

It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).

It also is a poor non-human readable format, as it lacks crucial types (datetime, decimal) is rather verbose (just imagine the terabytes bandwidth wasted, globally on all these commas and curly braces) and cannot be parsed streaming.

berniedurfee3y ago

Agreed. Lack of comments and lack of proper schema enforcement makes JSON a junk pile of a data format.

Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.

2 more replies

Someone3y ago

> as it lacks crucial types (datetime, decimal)

Decimal is there, isn’t it, as part of ‘number’, which lumps together fixed-size integers, IEEE floats and doubles, bigints and bigfloats, and leaves it to the parser to make the right guess as to how to interpret them (often assisted by having the programmer specify a target data type)?

Lumping all those together as ‘number’ makes for a much simpler grammar, but I think that’s not worth it for its negative effects on interoperability. How many json parsers can (easily) read a value of 1234567890123456890E0000000001 and write back the same string to a json file?

Apart from datetime, I’d add duration as a basic type, too, even though that has the problem of being imprecise. Users will want to specify durations such as “1m” or “1y” (is that 365 days? 365.2425 to correct for leap years? Does it depend on interpretation by the user? is it a lunar year of 354/355 days as in the Islamic calendar? Etc)

IshKebab3y ago

The lack of comments isn't really a big deal for serialisation. Nobody is really going to bother adding comments when data is serialised. You could do that with XML if you wanted and I have never seen a single occurrence.

The issue is that JSON ended up being used as a configuration format too, and comments are critical there.

JSON5 is the best answer I have found.

koito173y ago

EDN at least provides date literals, UUID literals, and decimals. It also does not require commas to be littered everywhere. Unfortunately EDN is only really well-supported within Clojure. I'm not sure what is meant by "context problem" for JSON, but keywords and symbols in EDN can be namespaced, so that one can distinguish e.g. :human/name and :computer/name. However, it still requires you to write a schema. And due to lack of support in other languages, protocol buffers seem to be a much better choice.

1 more reply

codeflo3y ago

1. I 100% agree about comments.

2. I’m still not convinced it’s the file format’s job to encode every possible data type. We used to have deserialization layers for that, but for some reason, we don’t do that anymore. Nowadays we expect every file format to be self-describing, which I think is wasteful and limiting. But that’s a matter of opinion, obviously.

3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.

2 more replies

kordlessagain3y ago

Then I guess the same is for all these dicts I use in Python.

timcobb3y ago

But in systems where JSON is used inefficiently, it’s rarely (with some notable exceptions [0]) the a performance limiting factor.

[0] https://news.ycombinator.com/item?id=26296339

pohuing3y ago

That one wasn't even down to json, but rather some c stdlib functions being in linear time because c strings(which then compounded into quadratic) right?

Culonavirus3y ago

> using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

Database dumps say hello. :)

alex_sf3y ago

If you want human-readable dumps, there's no reason you can't convert the more-efficient format into something human-readable.

1 more reply

flandish3y ago

If you want human readable db output, I might introduce you to what I’ve spent most of my career doing: etl. :)

afloyd3y ago

Human readable formats should exist for more or less one purpose, interfacing with a human, eg configuration.

timcobb3y ago

Developing and debugging systems with JSON (vs, say, protocol buffers) is much easier… configuration is rarely interfaced with by humans vs. debug logs or dumps etc, which are always being looked at.

2 more replies

deepsun3y ago

But json is terrible for configuration, e.g. comments are absolutely required. Toml and Hocon are for configuration.

1 more reply

jmull3y ago

Developers are the humans (at least for now) that interface with JSON.

And devs don't care about just config files put API requests and responses and any other kind of structured/semi-structured communication and data.

jffhn3y ago

>using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

In the encode/decode library we use, the generated toString were accidentally so close to json, that we made them exactly so (mostly had to add quotes). Now we can use json tooling on top of efficient messaging.

Vecr3y ago· 10 in thread

The formula to calculate the numbers of needed servers is wrong, as you can't have fractional servers. You should use the pigeonhole principle[0]

[0]: https://en.wikipedia.org/wiki/Pigeonhole_principle

nmilo3y ago

So round it, this is engineering, not math. The safety buffer factor accounts for any rounding errors anyways.

sweettea3y ago

The formula for number of needed servers seems like a fine approximation; I'm very confused how you could use the pigeonhole principle. Perhaps to say that if you have N servers, one of them has at least total_users/N users? But assuming you have a decent methodology for balancing users across servers, and can rebalance to use a new server when added, this has basically the same effects as the formula given so I'm clearly missing why you pointed this out.

Dylan168073y ago

Use it how? The actual pigeonhole principle would conclude that you have multiple users on at least one server, and nothing else.

And anything that calculates individual users is the wrong way to do this math.

tiler29150723y ago

As others have pointed out for the authors discussion this fact doesn’t really matter.

However what does matter is that the authors probably is not estimating the average correctly partially due to this bug. If they scaled without fixing their excessive string storage, they would probably find their estimate to be off.

Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.

IncRnd3y ago

That can obviously be rounded. Nobody will really try to purchase 27% of a server. A different issue is that the formula should include "1+Safety Buffer". Otherwise it needs to be run twice, once with the safety buffer and once without. Even so, I didn't take the formulae on this page to be accurate so much as representational.

gvx3y ago

I assumed the "Safety Buffer" is really more of a "Safety Factor", and is equivalent to 1 + what you or I would call a Safety Buffer.

1 more reply

BackslasherOP3y ago

The real formula we used to order servers was more complex than this, and included regional distribution, requirement peaks etc. I can expand more on that process if this is interesting. However, for the context of this post, this crude approximation suffices IMO

sd93y ago

The only difference in this case is to take the ceiling. The formula seems fine as an approximation.

cowthulhu3y ago

The formula/metric is clearly an approximation?

ilyt3y ago

It relies on common sense, thing that you lack.

tgsovlerkhgsel3y ago· 7 in thread

A better fix could have been switching to some other (binary) serialization protocol (BSON or similar) - the problem here doesn't seem to have been the size of the history itself, but the exponential growth of the escaping.

crdrost3y ago

Or, y'know, just

    jo.put("previous", previous);

Rather than

    jo.put("previous", previous.toJson());

Presumably the reason they didn't do this is because they couldn't handle a recursive schema?

btown3y ago

This is reminiscent of some APIs I've had the great pleasure of needing to integrate with, where a SOAP API would take XML containing Username, Password, Method, Data... where Data was a string, itself containing barely-documented XML, but serialized as a string before being placed in the XML. You'd think that if they knew how to stand up a SOAP API, they'd know that XML is literally designed to be nested. Alas, such was not the case. I shudder to think about their security under the hood...

1 more reply

alephaleph3y ago

In the post they say they didn’t do this bc they’d just be trading long strings for deeply nested objects.

2 more replies

paulddraper3y ago

That would have also fixed it, though IDK what is necessarily "better."

Recursively serialized JSON, XML, whatever is a code smell in general.

(Note that outside certain examples like this, in general BSON is not smaller than JSON. You'd need something like protobufs. )

kccqzy3y ago

Recursively serialized data isn't a smell. How would you even serialize a tree structure without recursion?

2 more replies

dan-robertson3y ago

You’d still be doing quadratic work over a user’s session but perhaps navigation is infrequent enough for it to not matter.

benatkin3y ago

sighs in bencode

dan-robertson3y ago· 5 in thread

How does the number of backslashes grow as the string is repeatedly escaped?

  1 \
  2 \\
  4 \\\\
  8 \\\\\\\\

We see it grows exponentially: to escape each backslash, you need two. So I guess I’m surprised that this is a post about a 1.5GB string and not the exponentially increasing work on each navigation causing performance problems, or the session crashing from running out of memory when a user navigates too far. Maybe the service wasn’t being used much yet or there’s some other reason that strings only grew to ridiculous sizes and not oom sizes. I’m curious why it didn’t exhibit sooner. (I saw a somewhat similar problem long ago but then it was more like doing ‘this.json = this.messages_array.to_json()’ on every new message, so merely quadratic. I think it was noticed not long after we first had a lot of messages added to one of the objects)

thaumasiotes3y ago

The solution to this problem has always been known: don't use the same escape character in different formats. With non-overlapping escape characters, you don't need to escape more than once.

So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.

It always boggles my mind that someone looked at this example of a working design and concluded that the escape character for a regular expression engine should be \, overlapping with the escape character for a literal string.

MayeulC3y ago

This is also why I like the sed syntax, that allows you to pick any delimiter, though escapes are still \ for regexes.

MBCook3y ago

It only takes about 30 doublings to get to 1.5 billion from an initial \\ if I have my math right.

I guess the history wasn’t that deep.

MichaelZuo3y ago

I don't understand how this made it into production without anyone noticing the exponential if it's really a straight doubling each time.

2 more replies

donio3y ago

\ C-x ( C-a C-k C-y C-y C-x ) C-u 30 C-x e

1 more reply

usr11063y ago· 5 in thread

Did they buy this domain just for this blog posting about backslashes? No, the first HN submission is from 2017. Coincidence or an agenda?

rsstack3y ago

Coincidence :) I know him in real life and he has used that nickname for a decade+.

BackslasherOP3y ago

Ha, this is a nice coincidence. I was looking for a cool domain name a billion years ago and came up with Backslasher since I was just learning text processing in Bash/Perl. My lifelong goal is beating the same-named horror film in search results.

augusto-moura3y ago

Maybe the story is not that recent? I will bet on a coincidence though, pretty interesting nonetheless

ehPReth3y ago

what?

stingraycharles3y ago

The story is about backslashes. The domain is backslasher.net. Seems like a fun coincidence to mention.

1 more reply

IncRnd3y ago· 4 in thread

That's a case of not solving the problem.

Leaving aside the need to send a massive set of screen images for the moment, a severe issue with that protocol is the lack of compression. It doesn't need to use a quadtree[1], although that will significantly speed up the interface when sliding the tree depth to the screen action being taken. Even RLE[2], Huffman[3], or any other standard string compression[4][5] would solve this issue shown in the blog.

[1] https://en.wikipedia.org/wiki/Quadtree#Compressed_quadtrees

[2] https://en.wikipedia.org/wiki/Run-length_encoding

[3] https://en.wikipedia.org/wiki/Huffman_coding

[4] https://en.wikipedia.org/wiki/Data_compression#Image

[5] https://en.wikipedia.org/wiki/Lossless_compression#General_p...

Genbox3y ago

Efficient data handling is better than spending cycles on generating a lot of backslashes and then burn cycles on compressing it.

IncRnd3y ago

It's also good to properly manage transmission of that list of screens, even after you've fixed the backslashes. The article stated that they degraded the user experience by artificially limiting the string length instead of fixing the issue.

1 more reply

duskwuff3y ago

"Screens" is referring to the state of a multi-step UI (similar to the back/forward cache in a browser), not a bitmapped image.

IncRnd3y ago

Yes. In what way does that mean that the transmission size should not be compressed? Bitmapped images are not the only data structure whose size will shrink by removing redundancy.

1 more reply

NegativeLatency3y ago· 3 in thread

> Since JSON fields have quotes, and those quotes need to be escaped when stored in a string, they are stored as \"

I don't think they do though?

andrewmackrodt3y ago

They were serialising the previous object as a string (which may also contain a key called previous which is a string) so quoted values, backslashes and any other escape characters would be escaped. Over multiple screens this grows quickly, e.g. this simple struct which has only an id column grows to 2KB in only 9 steps:

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: JSON.stringify(obj) }
    }

eyelidlessness3y ago

The problem as stated is more complicated than it needs to be, due to forcing the serialized structure to match the in memory structure. Their in memory structure is a singly linked list. With JSON as a serialization format, that’s probably better modeled as an array with the “previous” link stored as the head (history[0]), and its “previous” stored as the tail (history[1]).

If this sounds like a cons cell, that’s what it is! It’s also a linked list. The laziness they want to achieve (only de/serialize one screen at a time per navigation forward or back) is also straightforward.

- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket

- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history

There’s a tiny amount of overhead to this approach, but not nearly as much as repeatedly reserializing the same string to shoehorn it into a serialized structure that doesn’t let you cheat a little bit with well known start/end characters.

NegativeLatency3y ago

Right, I was suggesting something like this (since we're already in JSON land, why leave it for string land):

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: obj }
    }

missblit3y ago· 2 in thread

> Understanding we are memory-bound is crucial, as it directs our efforts towards reducing memory consumption in order to accommodate more users on the server.

While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,

> e.g. examining a memdump rather than just measuring overall memory utilization

A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?

Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.

thedufer3y ago

This service keeps around a lot of long-lived per-session state, which leads to a very different memory profile from a typical web service, like what you're describing. Your point is valid in general, but doesn't seem to be relevant here.

missblit3y ago

Yeah, but any excuse to talk about memory profiling! It's one of my favorite past-times.

YouWhy3y ago· 2 in thread

Wow! So this is actually a drama I had a secondary role in.

A few remarks:

1. Team: Nitzan (the blog author, and an awesome dude!) was at the time the Production Engineer monitoring the top-line capacity metrics. The Java heapdump tooling was hacked together by E.A., with some tough bits by A.S., who's a force of nature. The bulk of the mitigation was carried out by Y.B. over several months. I was the person who analyzed a couple of memory dumps and framed URL strings as a worthwhile 80/20 goal.

2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like

semanticallyUsefulURL = decodeURLString(urlStoredInStringForm)

What Y.B. ended up doing was

2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,

2.2. Carrying out a massive automated rewrite at the source level ("codemod"), which is possible in Java but not for the faint of heart. I think he used some JetBrains tooling to get that going. I consider his work a tour de force.

I vaguely recall there being some modest CPU improvements in the process.

3. Organizational dynamics: the codebase was originally written by very competent people who did not work through that specific detail because it was not important for their original use case. However subsequently the code underwent several years of almost solely rewarding improvement in end-user metrics, leading to an overall inadequate state that was hurting the organization as it was hitting hard scaling limits. In fact, "engineering excellence" was not even a category for recognition at that time. In this climate I could not find my own voice as a SW professional and chose to quit after a little bit more than a year.

I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.

(Edited - added one more due prop!)

BackslasherOP3y ago

A.S --> E.A, no? :) I decided to abstract away the URL part as it wasn't easily explainable and not important to the story. I figured the JSON-in-JSON part is interesting enough.

Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know. Good times.

YouWhy3y ago

You're fully right! I added them both - E.A. was the original toolmaker, and A.S. worked more down the chain. He definitely was the person who walked me through the finer points of retrieving the data.

1 more reply

nayuki3y ago· 2 in thread

I wrote some stress-test cases for a Java library recently which involved 1-GB and 2-GB strings.

On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.

But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.

This means that strings containing only ASCII characters can be up to about 2 billion in length, whereas strings with real Unicode content can only be up to about 1 billion in length. This is a bit of an unfortunate regression in functionality.

ungamedplayer3y ago

Naive question, why not just do the java equivalent of mmap and read the data from storage?

Seems like somewhere along the line reality outgrew the original requirements.

nayuki3y ago

https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...

1 more reply

Phelinofist3y ago· 2 in thread

Why not just restrict the history to a configurable number of entries? Having like 20-50 is probably enough. It might no be 100% correct but the effort:use ratio seems good.

BackslasherOP3y ago

We ended up doing that. We decided that while we have to store the stack, it can't hold 20 years of history.

> creating a dedicated real stack with self-imposed size limits and reporting.

scotty793y ago

With this escaping of escaped strings they are doubling the number of escapes each time which means they can reach 1GB with just 30 steps.

Dwedit3y ago· 2 in thread

Meanwhile, Firefox has big problems when you give it a 6MB Data URL, you get crashes or graphical glitches throughout the browser while it tries to draw the URL bar.

TechBro86153y ago

That seems like a weird oversight, since Firefox is normally quite good at rendering large assets. For example, I prefer viewing large SVG images (like those generated by dependency graphing tools) in Firefox, because unlike Chrome, it allows me to pan and zoom within the image.

chrisweekly3y ago

My kneejerk rxn is, this seems more about URL length than asset rendering.

1 more reply

gumby3y ago· 2 in thread

Why are they saving the JSON as a string in the first place, and not as a datastructure?

Scarbutt3y ago

It's a networked application.

iudqnolq3y ago

But they still have to serialize the JSON object containing the string.

I wonder if it was an "optimization" to avoid serializing full object every time it was transferred?

1 more reply

preseinger3y ago· 2 in thread

> So each screen has the previous screen the user visited, to allow the user to go “back” and get the exact screen they were in before (state, scrolling position, validation notices etc).

this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded

the backslashes encoding stuff is a symptom

preseinger3y ago

downvotes? what?

nothing i wrote is in any way controversial?

obviously you can't embed unbounded history in session state directly?

mkl3y ago

You can store a heck of a lot more states if you don't double the storage required with each additional state. That's the actual problem.

sour-taste3y ago· 2 in thread

It seems like just using the browser to store forward/backwards state with window.history would be easier than worrying about this server side. Maybe it was a legacy site that didn't work well with forward/back or a native app?

fake-name3y ago

Nothing in the article indicates that the client here is a browser.

I read it as some service with a custom client application.

BackslasherOP3y ago

This is correct :)

IngvarLynn3y ago· 2 in thread

Assuming 3 backslashes per iteration, one iteration per second - it would take 16 years to get to 1.5GB. This code problem looks deeper to me.

NieDzejkob3y ago

The growth is exponential - each backslash becomes two in the next iteration. Thus after n iterations we have 2^n - 1 backslashes, and we only need 30 iterations to hit a gigabyte (and that's assuming only one quotation mark in the original JSON).

cyanydeez3y ago

Every quote in a jsonstring requires backslashes. This is geometric explosion. You're assuming very simple state objects.

wadefletch3y ago· 1 in thread

What is the CSS effect as you enter the page? Are they streaming the CSS itself, or just an animation in the browser? Very cool effect.

pkage3y ago

It's a CSS animation from the Jekyll theme. Pulled from the minified CSS:

    @keyframes intro {
        0%   { opacity: 0 }
        100% { opacity: 1 }
    }

Applied via:

    animation: intro 0.3s both;
    animation-delay: 0.15s;

cratermoon3y ago

I'm currently consulting with a company in the aviation industry that is struggling to adapt a legacy-bound culture to modern software engineering techniques. Among other things they have no sense of systems-level concerns. I've recently identified one of the most common user activities ends up executing the same expensive end-to-end query at least 4 times for each interaction, never bothering to memoize the results. While there is a rudimentary cache in place, the lack of awareness that the same interaction involving at least 3 network requests per query even when the cache is warm has not yet struck them as a problem. As this system is currently only handling a fraction of the total traffic it will be expected to manage when it's fully live, it's clearly a slow-burning fuse that will explode in their faces when they try to make the switchover complete.

anonymoushn3y ago

It seems like the memory used will grow only linearly if you escape " and \ as "\u0022" and "\u005c" respectively.

jldugger3y ago

https://rachelbythebay.com/w/2023/04/09/note/

> I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.

> And I went... hey, I think I know that particular circus!

ww5203y ago

Good analysis.

The previous session data probably needs to be deserialized from json to objects first before adding in the new session object. The whole container can then be serialized to json at one shot. This avoids the recursive escape encoding.

ben0x5393y ago

Wonder if it's worth putting special logic into json serialization libraries that politely raises some alarms when writing like a few hundred \ in a row.

sxv3y ago

| 1.5GB string

Genomics data has entered the chat.

j / k navigate · click thread line to collapse

138 comments

97 comments · 23 top-level

userbinator3y ago· 21 in thread

ilyt3y ago

That's a massive exaggeration. XML is both much more complex (and bug-prone) to parse, and wastes more space.

And the problem here wasn't JSON in the first place anyway, it was

> The “back” stack is unlimited, meaning we’re saving more and more state until we explode

And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>

userbinator3y ago

You'd have same problem if you tried to serialize <format> into a field in <format>

Length-delimited formats don't need (exponentially increasing) escaping.

simplotek3y ago

> And the problem here wasn't JSON in the first place anyway

Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.

Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.

1 more reply

deepsun3y ago

But why do you bring XML? The parent is against human-readable formats, and XML is one of them.

berkes3y ago

I'd argue that JSON is one of the least Human Readable serialization formats.

It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).

berniedurfee3y ago

Agreed. Lack of comments and lack of proper schema enforcement makes JSON a junk pile of a data format.

Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.

2 more replies

Someone3y ago

> as it lacks crucial types (datetime, decimal)

IshKebab3y ago

The issue is that JSON ended up being used as a configuration format too, and comments are critical there.

JSON5 is the best answer I have found.

koito173y ago

1 more reply

codeflo3y ago

1. I 100% agree about comments.

3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.

2 more replies

kordlessagain3y ago

Then I guess the same is for all these dicts I use in Python.

timcobb3y ago

But in systems where JSON is used inefficiently, it’s rarely (with some notable exceptions [0]) the a performance limiting factor.

[0] https://news.ycombinator.com/item?id=26296339

pohuing3y ago

That one wasn't even down to json, but rather some c stdlib functions being in linear time because c strings(which then compounded into quadratic) right?

Culonavirus3y ago

> using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

Database dumps say hello. :)

alex_sf3y ago

If you want human-readable dumps, there's no reason you can't convert the more-efficient format into something human-readable.

1 more reply

flandish3y ago

If you want human readable db output, I might introduce you to what I’ve spent most of my career doing: etl. :)

afloyd3y ago

Human readable formats should exist for more or less one purpose, interfacing with a human, eg configuration.

timcobb3y ago

2 more replies

deepsun3y ago

But json is terrible for configuration, e.g. comments are absolutely required. Toml and Hocon are for configuration.

1 more reply

jmull3y ago

Developers are the humans (at least for now) that interface with JSON.

And devs don't care about just config files put API requests and responses and any other kind of structured/semi-structured communication and data.

jffhn3y ago

>using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

Vecr3y ago· 10 in thread

The formula to calculate the numbers of needed servers is wrong, as you can't have fractional servers. You should use the pigeonhole principle[0]

[0]: https://en.wikipedia.org/wiki/Pigeonhole_principle

nmilo3y ago

So round it, this is engineering, not math. The safety buffer factor accounts for any rounding errors anyways.

sweettea3y ago

Dylan168073y ago

Use it how? The actual pigeonhole principle would conclude that you have multiple users on at least one server, and nothing else.

And anything that calculates individual users is the wrong way to do this math.

tiler29150723y ago

As others have pointed out for the authors discussion this fact doesn’t really matter.

Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.

IncRnd3y ago

gvx3y ago

I assumed the "Safety Buffer" is really more of a "Safety Factor", and is equivalent to 1 + what you or I would call a Safety Buffer.

1 more reply

BackslasherOP3y ago

sd93y ago

The only difference in this case is to take the ceiling. The formula seems fine as an approximation.

cowthulhu3y ago

The formula/metric is clearly an approximation?

ilyt3y ago

It relies on common sense, thing that you lack.

tgsovlerkhgsel3y ago· 7 in thread

crdrost3y ago

Or, y'know, just

    jo.put("previous", previous);

Rather than

    jo.put("previous", previous.toJson());

Presumably the reason they didn't do this is because they couldn't handle a recursive schema?

btown3y ago

1 more reply

alephaleph3y ago

In the post they say they didn’t do this bc they’d just be trading long strings for deeply nested objects.

2 more replies

paulddraper3y ago

That would have also fixed it, though IDK what is necessarily "better."

Recursively serialized JSON, XML, whatever is a code smell in general.

(Note that outside certain examples like this, in general BSON is not smaller than JSON. You'd need something like protobufs. )

kccqzy3y ago

Recursively serialized data isn't a smell. How would you even serialize a tree structure without recursion?

2 more replies

dan-robertson3y ago

You’d still be doing quadratic work over a user’s session but perhaps navigation is infrequent enough for it to not matter.

benatkin3y ago

sighs in bencode

dan-robertson3y ago· 5 in thread

How does the number of backslashes grow as the string is repeatedly escaped?

  1 \
  2 \\
  4 \\\\
  8 \\\\\\\\

thaumasiotes3y ago

The solution to this problem has always been known: don't use the same escape character in different formats. With non-overlapping escape characters, you don't need to escape more than once.

So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.

MayeulC3y ago

This is also why I like the sed syntax, that allows you to pick any delimiter, though escapes are still \ for regexes.

MBCook3y ago

It only takes about 30 doublings to get to 1.5 billion from an initial \\ if I have my math right.

I guess the history wasn’t that deep.

MichaelZuo3y ago

I don't understand how this made it into production without anyone noticing the exponential if it's really a straight doubling each time.

2 more replies

donio3y ago

\ C-x ( C-a C-k C-y C-y C-x ) C-u 30 C-x e

1 more reply

usr11063y ago· 5 in thread

Did they buy this domain just for this blog posting about backslashes? No, the first HN submission is from 2017. Coincidence or an agenda?

rsstack3y ago

Coincidence :) I know him in real life and he has used that nickname for a decade+.

BackslasherOP3y ago

augusto-moura3y ago

Maybe the story is not that recent? I will bet on a coincidence though, pretty interesting nonetheless

ehPReth3y ago

what?

stingraycharles3y ago

The story is about backslashes. The domain is backslasher.net. Seems like a fun coincidence to mention.

1 more reply

IncRnd3y ago· 4 in thread

That's a case of not solving the problem.

[1] https://en.wikipedia.org/wiki/Quadtree#Compressed_quadtrees

[2] https://en.wikipedia.org/wiki/Run-length_encoding

[3] https://en.wikipedia.org/wiki/Huffman_coding

[4] https://en.wikipedia.org/wiki/Data_compression#Image

[5] https://en.wikipedia.org/wiki/Lossless_compression#General_p...

Genbox3y ago

Efficient data handling is better than spending cycles on generating a lot of backslashes and then burn cycles on compressing it.

IncRnd3y ago

1 more reply

duskwuff3y ago

"Screens" is referring to the state of a multi-step UI (similar to the back/forward cache in a browser), not a bitmapped image.

IncRnd3y ago

Yes. In what way does that mean that the transmission size should not be compressed? Bitmapped images are not the only data structure whose size will shrink by removing redundancy.

1 more reply

NegativeLatency3y ago· 3 in thread

> Since JSON fields have quotes, and those quotes need to be escaped when stored in a string, they are stored as \"

I don't think they do though?

andrewmackrodt3y ago

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: JSON.stringify(obj) }
    }

eyelidlessness3y ago

- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket

- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history

NegativeLatency3y ago

Right, I was suggesting something like this (since we're already in JSON land, why leave it for string land):

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: obj }
    }

missblit3y ago· 2 in thread

> Understanding we are memory-bound is crucial, as it directs our efforts towards reducing memory consumption in order to accommodate more users on the server.

> e.g. examining a memdump rather than just measuring overall memory utilization

thedufer3y ago

missblit3y ago

Yeah, but any excuse to talk about memory profiling! It's one of my favorite past-times.

YouWhy3y ago· 2 in thread

Wow! So this is actually a drama I had a secondary role in.

A few remarks:

2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like

semanticallyUsefulURL = decodeURLString(urlStoredInStringForm)

What Y.B. ended up doing was

2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,

I vaguely recall there being some modest CPU improvements in the process.

I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.

(Edited - added one more due prop!)

BackslasherOP3y ago

A.S --> E.A, no? :) I decided to abstract away the URL part as it wasn't easily explainable and not important to the story. I figured the JSON-in-JSON part is interesting enough.

Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know. Good times.

YouWhy3y ago

1 more reply

nayuki3y ago· 2 in thread

I wrote some stress-test cases for a Java library recently which involved 1-GB and 2-GB strings.

On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.

But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.

ungamedplayer3y ago

Naive question, why not just do the java equivalent of mmap and read the data from storage?

Seems like somewhere along the line reality outgrew the original requirements.

nayuki3y ago

https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...

1 more reply

Phelinofist3y ago· 2 in thread

Why not just restrict the history to a configurable number of entries? Having like 20-50 is probably enough. It might no be 100% correct but the effort:use ratio seems good.

BackslasherOP3y ago

We ended up doing that. We decided that while we have to store the stack, it can't hold 20 years of history.

> creating a dedicated real stack with self-imposed size limits and reporting.

scotty793y ago

With this escaping of escaped strings they are doubling the number of escapes each time which means they can reach 1GB with just 30 steps.

Dwedit3y ago· 2 in thread

Meanwhile, Firefox has big problems when you give it a 6MB Data URL, you get crashes or graphical glitches throughout the browser while it tries to draw the URL bar.

TechBro86153y ago

chrisweekly3y ago

My kneejerk rxn is, this seems more about URL length than asset rendering.

1 more reply

gumby3y ago· 2 in thread

Why are they saving the JSON as a string in the first place, and not as a datastructure?

Scarbutt3y ago

It's a networked application.

iudqnolq3y ago

But they still have to serialize the JSON object containing the string.

I wonder if it was an "optimization" to avoid serializing full object every time it was transferred?

1 more reply

preseinger3y ago· 2 in thread

> So each screen has the previous screen the user visited, to allow the user to go “back” and get the exact screen they were in before (state, scrolling position, validation notices etc).

this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded

the backslashes encoding stuff is a symptom

preseinger3y ago

downvotes? what?

nothing i wrote is in any way controversial?

obviously you can't embed unbounded history in session state directly?

mkl3y ago

You can store a heck of a lot more states if you don't double the storage required with each additional state. That's the actual problem.

sour-taste3y ago· 2 in thread

fake-name3y ago

Nothing in the article indicates that the client here is a browser.

I read it as some service with a custom client application.

BackslasherOP3y ago

This is correct :)

IngvarLynn3y ago· 2 in thread

Assuming 3 backslashes per iteration, one iteration per second - it would take 16 years to get to 1.5GB. This code problem looks deeper to me.

NieDzejkob3y ago

cyanydeez3y ago

Every quote in a jsonstring requires backslashes. This is geometric explosion. You're assuming very simple state objects.

wadefletch3y ago· 1 in thread

What is the CSS effect as you enter the page? Are they streaming the CSS itself, or just an animation in the browser? Very cool effect.

pkage3y ago

It's a CSS animation from the Jekyll theme. Pulled from the minified CSS:

    @keyframes intro {
        0%   { opacity: 0 }
        100% { opacity: 1 }
    }

Applied via:

    animation: intro 0.3s both;
    animation-delay: 0.15s;

cratermoon3y ago

anonymoushn3y ago

It seems like the memory used will grow only linearly if you escape " and \ as "\u0022" and "\u005c" respectively.

jldugger3y ago

https://rachelbythebay.com/w/2023/04/09/note/

> And I went... hey, I think I know that particular circus!

ww5203y ago

Good analysis.

ben0x5393y ago

Wonder if it's worth putting special logic into json serialization libraries that politely raises some alarms when writing like a few hundred \ in a row.

sxv3y ago

| 1.5GB string

Genomics data has entered the chat.

j / k navigate · click thread line to collapse