The fastest tool for querying large JSON files is written in Python (benchmark) (opens in new tab)

(colab.research.google.com)

175 pointsdmoura4y ago84 comments

84 comments

72 comments · 22 top-level

orlp4y ago· 22 in thread

Allow me a moment to hijack this thread. I'm a PhD student looking to make analytics of JSON data a first-class citizen in https://duckdb.org/. I'm still very much in a literature study stage, so I'd love to hear your success (or failure) stories about analyzing data stored in JSON. E.g. I'd love to hear about:

* Which dialect for querying worked the best for you?

* Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.

* What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

* What do you feel is the biggest gap/missing feature in currently available tools?

sandGorgon4y ago

Store the json in postgresql as JSONB.

Not trying to flame - I do realise that duckdb is an alternative, but everyone in this space does this

Especially for data scientists, etc. The tooling flow around dbt, pandas, jupyter, postgresql and json is rock solid.

boredumb4y ago

I can second that using JSONB with postgres is fast and easy, the features and operators to query json structures are also about as extensive as you'll ever need.

1 more reply

orlp4y ago

> Not trying to flame

Please, flame away. I'm not here to put up a fight for one approach or another, simply interested in hearing what people think works well and what doesn't; what they'd expect/want.

1 more reply

bachmeier4y ago

Wouldn't that require setting up and running a postgres server? DuckDB is an sqlite replacement.

1 more reply

PaulHoule4y ago

arangodb for the win. I've done data analytics in it and OLTP.

You never hear hype about arangodb because arangodb users see it as a "secret weapon" and want their competitors to use mongo.

1 more reply

klabb34y ago

> * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, [...]

Yes (also known as jsonlines or line-separated JSON - .jsonl is the file extension).

I think JSON should be thought of as a message format, and messages should be small. Messages can be sent over a stream, without having the "whole" data-set. You also get append-only semantics on files (saving disk-space and virtually creating a stupid write-ahead log become as cheap & hassle-free as it gets), and if you need it you get resumability (say binary search on a file with monotonically increasing ids/timestamps etc).

Additionally, this format will work with minimal hickups when interoperating with other systems. YOU may have an advanced, efficient parser, but others may not. Most environments have a line-parser and a json deserializer. Large JSON-files become hard-to-manage and prompts for more and more complex query-like tools.

And all of these good properties from such a ridiculously simple tech that it can be explained in a short sentence.

Downsides:

- Compression at the file level removes resumability (unless your compression is also resumable).

- You may have to implement your own log compaction if you have many updates/deletes.

gavinray4y ago

  > * Which dialect for querying worked the best for you?

If this is SQL dialects, I am partial to both the Postgres JSON/JSONB operators, SQLite has a solid JSON implementation too.

  > * Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.

"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:

https://github.com/simonw/datasette

From the same author, "sqlite-utils" generate SQLite table definitions and rows from similar:

https://github.com/simonw/sqlite-utils

I find this useful outside of SQLite because the syntax is similar across SQL databases. Great way to bootstrap a DB definition from existing datasets.

"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"

  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

For file, either JSONL/NDJSON (JSON object per line) if it's large, or a single array of objects if it's small.

In a SQL context, object-per-row undoubtedly

  > * What do you feel is the biggest gap/missing feature in currently available tools?

Don't have anything useful on this one unfortunately, but I'm not an analytics/data science person =(

dimatura4y ago

I find myself analyzing a piles of json files, or large json files now and then, usually for one-off or infrequent analyses. If I'll never touch this data again, then usually the first thing I do is collect all the data and do whatever is needed to put it in a pandas dataframe, and then throw away the original data. Thankfully I haven't had an issue with the RAM required for this in the last few years. (working with 64GB ram).

If I think I'll have to do at least a few more analyses in the future, perhaps with a growing dataset, I'll usually put the data into Sqlite. If possible I try to keep it simple, with a single table, even if it means a non-normalized schema. As for tooling I typically go with `dataset`, an sqlalchemy wrapper that's super simple to use, and makes it possible to also use raw sql if I need to. I haven't fully explored the JSON capabilities of sqlite itself, but have been meaning to. If duckdb gets similar features that would be certainly worth looking into.

In terms of "meta-format", I usually like object-per-line and array of objects. Easy to add more records, pretty self-explanatory. Maybe inefficient but if that becomes an issue then it's time to move away from just JSON.

In regards to querying language, I usually don't do anything too complicated so I don't think much about it. Having SQL (as when using dataset) is nice. I also have to use mongodb for some tasks, and I also find that query language good enough for most things I need to do, since I'm not usually dealing with highly relational data.

ellimilial4y ago

If it fits on a single machine - jq, flat files, JSON lines / avro if relatively flat. Change to a tabular format if when nesting not required.

Postgres JSONB works, but it requires maintaining a heavy server process. So does Lucene/elasticsearch.

I have been yearning for embeddable store (in line with SQLite the support that both works and also keeps the data compressed like JSONB). I know there were some attempts, tried some of it those, mostly monstrosities).

cryptonector4y ago

JSONB is incredibly awesome, and should be extracted from PG and made usable on its own.

For those who don't know, JSONB is a binary JSON encoding that is specifically optimized for data at rest and compression thereof.

The key feature in JSONB is that most internal pointers [from arrays and objects] to values are in the form of lengths, with every 32nd pointer being an offset. This comes from the observation that offsets will not repeat, therefore are difficult to compress w/ off the shelf compression algorithms, but length values will often be the same and thus be compressible. This means that iterating an array (say) requires 31 additions for every 32 elements to recover the offsets to those 31 elements' values.

The story of how they came to this optimization for compression is fascinating. IIRC they implemented an offsets-only JSONB and were very happy with it until they discovered that that form of JSONB did not compress anywhere near as well as expected, and since PG was close to shipping, a feverish hunt for the cause ensued that culminated in the fix of mostly-using-lengths-instead-of-offsets.

1 more reply

galdosdi4y ago

I cannot describe how much I love jq. Best new (to me) tool I discovered in all of 2020.

Once you get the hang of it the syntax feels extremely powerful. The only other thing it reminded me of is the first time I learned enough SQL to be dangerous.

You really, really are going to want to check jq out at least a little if you want to improve the state of the art in this area. It has an excellent manual btw.

Edit: you ask about "metaformats" such as object per newline. jq handles this well too

carlineng4y ago

Snowflake’s handling of JSON data via the “variant” data type is quite nice to work with, both in terms of speed and convenience of the access methods.

geoduck144y ago

I'll add to this:

The variant column is useful for "let's dump this data into this column and separate it out in the nextstep"

Snowflake comes with a couple of methods of parsing semi structured data like JSON, XML, and even "custom" formats

We are playing around with Snowflake at work right now and I quite like it

iknownothow4y ago

I decided not to introduce DuckDB for my org, 6 team members and <10TB of data solely based on lack of full support for Postgres like JSONB. Currently, we are happy using Postgresql JSONB with a small performance penalty . I would be loathe to introduce a tool which requires a new syntax for handling JSONs.

I know this is a big ask, but if DuckDB can be in lock step with Postgres on JSONB handling :french-chef-kiss:

As far as I am aware (haven't researched), the standard way to get logs or metadata out of AWS Cloudwatch is in a JSON lines [1] format. Therefore, even if I were to write a custom JSON output format for my custom service, I would try my best to stick to the JSON lines format since I would have already written parsers and data connectors for AWS Cloudwatch. That's just me though.

[1] https://jsonlines.org/

peterohler4y ago

For me OjG (https://github.com/ohler55/ojg) has been great. I regularly use it on files that can not be loaded into memory. The best JSON file format for multiple record is one JSON document per record all in the same file. OjG doesn't care if they are on different lines. It is fast (https://github.com/ohler55/compare-go-json) and uses a fairly complete JSONPath implementation for searches. Similar to jq but using JSONPath instead of a proprietary query language.

I am biased though as I wrote OjG to handle what other tools were not able to do.

dmouraOP4y ago

Are you able to calculate aggregates, like an average?

jamal-kumar4y ago

I've written programs for digesting, parsing, and querying data out of many gigabytes of JSON REALLY REALLY fast by not doing any unnecessary memory allocation at all. Keeping everything into streaming APIs and out of that particular bottleneck is the key to doing this, really just no malloc, and that's what this article seems to be about kind of... When you manage to avoid that you can do it in pretty much any language. AWK/SED is perfect for when it comes to field-seperated data like CSV, these tricks are really old! When it comes to JSON I've just written my own stuff, though, because that nested-tree-structure is bound to be different for each data set.

jimsparkman4y ago

Since duckdb is OLAP, it would directly compare with other columnar database technologies like Redshift, Presto/Athena, etc.

Most of these systems strongly encourage or outright enforce JSONL, so that’s the defacto standard, and most tooling or pipelines are going to generate that nowadays.

You can obviously still have a row of arrays, and different systems have slightly different approaches on how to deal with those. In Spark, this is referred to as “exploding”, in Presto you would cross join to unnest an array, in Redshift you can glob on the super type.

I’m not sure I have a particular favorite, only that the database support such an operation since it is a common occurrence.

infinite8s4y ago

I have worked with JSON in SQLServer, and I agree with the sibling comment that JSONB is the ideal storage format. SLQServer's JSON querying functions are pretty straightforward, although they do expect that you know the schema of the json data you are querying.

As to JSON metaformat (I assume you mean for ingestion), a format of an array of objects works best.

For the last question, the main gap I find is trying to understand the 'structure' of json data that I didn't generate directly. There aren't great tools to show whether all the JSON records have the same nested structure with similar key names, etc.

zX41ZdbW4y ago

That's great, but you don't have to finish your PhD. The task is already solved in ClickHouse: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSS...

dorianmariefr4y ago

jq for quick lookup, jsonb for actual storage, indexing jsonb is a pain though

cryptonector4y ago

JSONB is optimized for compression, but also pessimized for writing.

What we should do is teach jq to read and write various binary JSON formats including JSONB.

dmouraOP4y ago· 7 in thread

Please take this claim and these results with a pinch of salt. spyql was not created with the goal of being the fastest tool for querying data, and it might be the case that the same tools with different datasets or in different use-cases outperform spyql. There might also be other tools that I was not aware when I wrote the benchmark (I just learned about a new one that we will be adding to the benchmark).

For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language.

spyql can both leverage the python standard lib for parsing json (written in C) as well as orjson (written in Rust). In this benchmark we used the later, which shows considerable performance improvements. Still, query processing (expression evaluation, filtering, aggregations, etc) are implemented in Python. I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high perfomance on core functionalities.

Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time:

    !spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json

    avg_agg_overall
    4.31181166791025 
    time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)


    import orjson as json
    acc = 0 
    cnt = 0 
    with open('books.json') as f: 
      for line in f: 
        acc += json.loads(line)['overall'] 
        cnt += 1 
    print(acc/cnt)

    4.31181166791025
    time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)

Olreich4y ago

simdjson can load the json into memory in a queryable form in ~1/3 of a second. So you can save yourself basically 40% of the runtime right there. Computing average should take less than 1/2 a second on modern hardware (assumes <10 million books). So back-of-envelope target speed should be less than 1 second for this benchmark.

11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query).

And the reason we use wrapped libraries in Python so often is because it’s abysmally slow to do anything in the interpreter. The average loop is 100x slower than it should be. The more math you do the worse it gets too. Most pure Python code is 1000x slower than it should be.

iknownothow4y ago

The example of processing and querying a 1GB "JSON Lines" [1] file, where each line is a json document 0.1-10KB in size with a varying schema on every line is a very common use case in data engineering. On top of that, there are additional constraints where we might only be allowed to allocate 1vCPU to the task, there's additional IO overhead of downloading the file from S3 and finally, even though there's TBs of the same data we only ever need to process a few GBs per hour or day. How well can simdjson perform under these circumstances [2]? Probably quite well but not as fast as having to serialize a single 1TB json file.

So my metrics of success in this scenario are based on that fact that I have to deal with 10-100 such queries in a project in my day job, so I would choose SpyQL to write and maintain a simple and readable 5 line query in under 5 minutes with decent performance to solve a trivial use case of computing an average.

P.S. I know the article is about performance and your response about Python being is slow is beyond accurate and yet I will always choose to use it because it is not ashamed to sit on the shoulders of the fast and ugly.

[1] https://jsonlines.org/

[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...

1 more reply

blumomo4y ago

psyql is amazing, it's readme helped me to quickly grasping its value proposition and usage. immediately recommended to a colleague who is an SQL and CLI aficionado. the psyql repo and matplotcli should have many more github stars.

dmouraOP4y ago

Thank you!! :-D

jazzyjackson4y ago

> I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high performance on core functionalities.

I'm aware of python leveraging compiled C or even FORTRAN to run certain functions, does it do message passing with a forked process or something else? I'm having a hard time googling for how it works since I don't know what it's called.

Tarq0n4y ago

Are you familiar with foreign function interfaces? [0] Threading can be handled Either by the host program or the C / Fortran code.

[0] https://docs.python.org/3/extending/extending.html

IshKebab4y ago

Feels a little disingenuous to highlight that it's written in Python when actually Rust is doing the heavy lifting.

davidatbu4y ago· 4 in thread

I asked about this on the Github issue regarding these benchmarks as well.

I'm curious as to why libraries like ultrajson[0] and orjson[1] weren't explored. They aren't command line tools, but neither is pandas (which is benchmarked) right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, `python -c "..."`)?

[0] https://github.com/ultrajson/ultrajson

[1] https://github.com/ijl/orjson

dmouraOP4y ago

The idea was to focus on querying tools. ujson and orjson (as well as the json module from python's standard library) offer json decoding and decoding but not a querying language: you need to implement the query logic in Python, resulting in large programs with lots of boilerplate. Still, I agree that Pandas is an outlier... it was included due to its popularity for querying datasets.

I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

davidatbu4y ago

> resulting in large programs with lots of boilerplate

That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.

Thank you for this benchmark! I'll probably switch to spyql now from jq.

> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there.

A parallel I think is helpful to think about is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.

I'm not sure how such claims advance our understanding of the tradeoffs of programming languages. I initially thought that I was going to change my mind about my impression that "python is not a good tool to implement fast parsing/querying", but now I haven't, so I do think the title is a bit misleading.

[0] https://github.com/google-research/bert

1 more reply

jammycrisp4y ago

> I should mention that spyql leverages orjson, which has a considerable impact on performance

Even with orjson, you're still paying the cost of creating a new PyObject for every node in the JSON blob. orjson is well engineered (as is the backing serde-json decoder), but any JSON decoder that isn't using naive algorithms is mostly bound by the cost of creating PyObjects. Allocating in Python is _slow_.

I wrote a quick benchmark (https://gist.github.com/jcrist/de29815389eaed4eaf5b24fbcfdab...) showing a handwritten query that accesses only a few fields in a 13 MiB JSON file. The same query is repeated with a number of different Python JSON libraries. Results:

    $ python bench_repodata_query.py 
    msgspec: 45.018014032393694 ms
    simdjson: 61.94157397840172 ms
    orjson: 105.34720402210951 ms
    ujson: 121.9699690118432 ms
    json: 113.79130696877837 ms

While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.

If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.

the_duke4y ago

When a Python tool is unexpectedly fast the answer is almost always: because the expensive part is implemented externally in a low level language.

flurly4y ago· 3 in thread

I personally tend to put all my data for analysis into clickhouse (which is written in C++ FWIW). My side project actually is all about ingesting JSON data into clickhouse and making it super easy to query, visualize, monitor and alert on JSON data.

You certainly pay a computation cost everytime you want to scan the data, but I think the flexibility more than makes up for it. You can take a read about how and why I built GraphJSON here if you're interested https://www.graphjson.com/guides/about

mritchie7124y ago

This is awesome. Assuming you saw that 22.3 has a JSON column type?

https://clickhouse.com/blog/clickhouse-22-3-lts-released/

ps - we're built on Clickhouse at Luabase (https://luabase.com/) and hiring across all role, send me a note (mike@luabase.com) if you're into web3+analytics.

qoega4y ago

Have you tried ClickHouse for dataset above? Interesting if it is faster or not

zX41ZdbW4y ago

It is appeared that clickhouse-local is faster than all of the rest, with a huge margin: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSS...

1 more reply

wdroz4y ago· 3 in thread

Why they didn't compare with Polars [0]? According to the h2oai benchmark [1], Polars is fast.

[0] -- https://www.pola.rs/ [1] -- https://h2oai.github.io/db-benchmark/

dmouraOP4y ago

The initial idea was to focus on cmd line tools... I added pandas for comparison, as it is one of the most used libs to work with datasets. I will either remove Pandas from the equation or add Polars. By the way, I run some benchmarks and polars seems a bit faster than spyql for the aggregation challenge, but does not scale (loads everything into memory)

Tarq0n4y ago

Polars is for tabular data, a different kind of task.

wdroz4y ago

But you can read json from polars (or from pandas first then polars) then you can do the map/reduce/filter like in the benchmark.

cube22224y ago· 2 in thread

Hey, OctoSQL author here.

Great to see such benchmarks and another dataset to test with!

Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.

OctoSQL supports retractions. That means an input for OctoSQL can not only be producing records, but also saying "I'm taking this record back", and OctoSQL will calculate the correct output based on that with minimal processing overhead required. It's a dataflow system in that sense.

To achieve that, it has to buffer data at the output to be able to display the current output state (which might have rows removed from it). It’s the same case for LIMIT which is right now very unoptimal in OctoSQL. Because the top 100 rows can actually change if you have an ORDER BY clause specified and new rows/retractions arrive - it buffers the whole output, and then displays the top 100 rows.

Running with the stream_native output format should drastically reduce the memory overhead and possibly processing time (I’ll post an update when the data is downloaded and I can compare), at least in the first case.

Obviously, OctoSQL could be much smarter about detecting whether a source can even produce retractions, and optimize it’s behavior based on that. No point in buffering if retractions are not possible (which is the case for i.e. a JSON file). This would make LIMIT and the output table much faster and lighter.

It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)

And congrats to SPyQL for the first place here, its design is awesome.

EDIT: Running the first test with `--output stream_native` (which just sends the productions and retractions as they come to the output, instead of buffering and sorting them) does indeed run 2-3x faster and use 50 MB RAM - instead of GB's.

[0]:https://github.com/cube2222/octosql

dmouraOP4y ago

Thanks @cube2222! Great! Sorry for overlooking that option. I will definitely add that option to octosql to make the comparison fair. Should I add it to the 3 queries?

cube22224y ago

No worries! I think it'll be most fair if I just let you know when these are fixed, and you can properly use them, performantly, with a sane output option (that's not stream native, but something like JSON). I.e. LIMIT won't work with stream_output.

If you want to add it to the first one, I won't oppose, but don't feel pressured to do it. Especially the third query does highlight a very realistict limitation of OctoSQL as it works right now, so it's only fair as it is.

1 more reply

procrastinatus4y ago· 2 in thread

Daniel Lemire’s simdjson probably belongs in this discussion and I would be surprised if it is not the fastest tool by some margin.

https://github.com/simdjson/simdjson

peterohler4y ago

For some specific cases simdjson might be the fastest but it does fare as well in other cases as seen here: https://github.com/ohler55/compare-go-json or here: https://github.com/ohler55/ojc/blob/master/misc/compete.md. Parsing the JSON file is only part of the solution though. There needs to be an efficient and easy way to extract the data of interest. OjG (https://github.com/ohler55/ojg) has a JSONPath implementation for that. Written in go is a high performer.

dmouraOP4y ago

Wow, I will check it out, thanks!!

gabrielsroka4y ago· 2 in thread

Focusing on ease of use instead of performance (I'm usually looking at small files) but I found jq's syntax to be really difficult, so I wrote my own teeny tiny tool in Python:

https://replit.com/@gabrielsroka/Bash#pj.py

Obviously it only does a fraction of what jq does, but it does what I need and I can expand it in a fraction of the time it would take to learn jq.

Inspired by https://docs.python.org/3/library/json.html#module-json.tool

soheilpro4y ago

My main problem with jq was finding the paths to the nodes, so I wrote catj (https://github.com/soheilpro/catj) to help with that.

dmouraOP4y ago

Loved the simplicity of your solution :-) Congrats!

chewbacha4y ago· 1 in thread

Not sure the HN title is that great. The actual title is “ The fastest tools for querying large JSON datasets” and the bit about it being written in python is editorialized. The analysis includes products written in go, python, and Java but doesn’t appear to care about the implementation language at all as you must click through to the GitHub to see the language.

I’m also not confident that the list was exhaustive.

dmouraOP4y ago

Yes, I agree with you. I guess the title is a bit too provocative... still trying to understand what tools I might have missed, and this seemed a good way of doing that... I do not want to give wrong impressions to people that only read the title, so I might have been too impulsive when choosing the title...

eatonphil4y ago· 1 in thread

Thanks for including dsq! Its focus is always going to be flexibility and UX over performance since it's just a thin wrapper over the internals of a GUI data integration tool [0]. For example many other tools operate on only one or two file types and have optimized for querying on those types.

Some like octosql and spyql IIRC have implemented their own sql parsing layer and engine. So if full sql support is necessary you may not be able to use them.

In contrast dsq supports excel, avro, parquet, json, csv/tsv, nginx logs, apache logs, json newline, concatenated json, etc. And this list keeps growing. Dsq stores data in sqlite under the hood so you have access to the full sqlite dialect. Textql and q also do it this way.

That said, despite performance not being the major I goal I think it will get better over time. There's already an OSS contributor who happens to be thinking about perf.

If you're interested in getting started with OSS contributions btw I keep a page of good first projects [0]. If you know Go you're in a good place to contribute. I've already had a number of people take me up on this and merged a number of PRs by first-time OSS contributors.

[0] https://github.com/multiprocessio/datastation

[1] https://github.com/multiprocessio/datastation/blob/main/GOOD...

dmouraOP4y ago

Thank you Eaton! The truth is, if I was seeking to have the best performant tool I would not choose Python to start with... so performance is only a part of tools like the ones we are writing offer. Thank you for you tip, I will look into it ;-)

benob4y ago· 1 in thread

orjson, the fast underlying json parser, is written in rust.

shepardrtc4y ago

As soon as I saw Python, I knew they were using orjson.

The title of this post is incorrect. It should simply be the title of the Colab notebook, "The fastest tools for querying large JSON datasets". While SPyQL is the fastest tool, it's not written entirely in Python.

irrational4y ago· 1 in thread

Just this week I has to query a JSON file that is 13-14 MB in size. I already had Node installed, so that it what I used. I was surprised that it was able to do the processing so fast that it was done as soon as I hit the enter key in the terminal window. And the processing wasn't trivial, it had to touch every single element of the entire JSON file and put the entire thing into a new data structure. I was quite impressed at how fast it was. It took my IDE about 5 minutes just to open the JSON file so I could take a look at it to see how it was structured.

jamal-kumar4y ago

Sounds like your IDE is a bit bloated. Not to tell you to stop using it or anything, that's a tall order - but next time you have a sufficiently large file (15mb shouldn't choke an editor?), try head/tail/less/more and you'll save those 5 minutes too

pmdulaney4y ago· 1 in thread

Cool! Let's port it to C and make it really fast!

dmouraOP4y ago

:-D out of curiosity, jq, one of the tools in the benchmark, is written in C

dmouraOP4y ago

Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:

* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);

* OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;

* SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);

* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.

This benchmark is a living document. If you are interested in receiving updates, please subscribe to the following issue: https://github.com/dcmoura/spyql/issues/72

Thank you!

PaulHoule4y ago

Tools like

https://www.alteryx.com/

are really limited in my view because they are based on a relational model that requires joins to do things that are easily expressed without joins in the object-relational (basically JSON) model.

"Boxes and lines" data pipelines that involve joins are a bear to maintain because of the structural instability: what looks like a little change to management can turn a very simple pipeline that goes in one direction to a pipeline that splits into four streams that have to get reassembled at the end.

If you were passing JSON documents between the nodes you can keep the stuff that would have been split out and then joined inside the document and the pipelines are a lot easier to maintain.

The people who develop those tools, however, are in love with the relational model because it is SIMDable, not branchy, easy to implement at hyper-speed so there is going to be an opportunity for people to make one that gets much better customer satisfaction.

To do that though somebody has to ask the question of "what algebra does this thing work over?"

PaulHoule4y ago

That SpyQL is pretty cute... I like how it imports Python modules directly.

It's a lot like list comprehensions on steroids but it does lack the high end features of SQL like joins and it has a simple query processing strategy that doesn't support query optimization like SQL.

It is convenient to incorporate the Python ecosystem (functions) but in the end it is not that strong because the query language doesn't understand Python. It reminds me of the Java Streams in the API which can represent all the blocks of the processing pipeline before you collect it, yet it doesn't buy very much because the map, filter, group, etc. functions are black boxes that it can't think globally about.

The Expression Tree version of LINQ in C# can do better (like compile what looks like a C# function to SQL) but it's disappointing that very few people really did anything with expression trees. (Give programmers the power of LISP and they say... Meh.)

fbwaters4y ago

So for large files the bottleneck is encoding/decoding, which is done by orjson and not Python. If you wrap orjson in C++, the results would be even faster.

gourneau4y ago

TIL that time can also output max memory use, and IO use very useful!

Like they do here:

/usr/bin/time -o performance.log -a -f"%e\t%M\t%x\t%C"

%e (Not in tcsh.) Elapsed real time (in seconds).

%M Maximum resident set size of the process during its lifetime, in Kbytes.

%x (Not in tcsh.) Exit status of the command.

%C (Not in tcsh.) Name and command-line arguments of the command being timed.

bob10294y ago

Potentially relevant here...

https://lemire.me/blog/2018/05/03/how-fast-can-you-parse-jso...

http://www.vldb.org/pvldb/vol10/p1118-li.pdf

"We experimentally evaluate Mison using representative real-world JSON datasets and the TPC-H benchmark, and show that Mison produces significant performance benefits over the best existing JSON parsers; in some cases, the performance improvement is over one order of magnitude."

Dowwie4y ago

Actually, it's written in Rust and wrapped by python.

benibela4y ago

You could also try my Xidel. With the dev build 0.9.9, not the last release 0.9.8, because recently I have implemented my own hash map and JSON parser to make it faster than before.

Although it is not fast nor supports large files (it loads everything in memory)

1vuio0pswjnm74y ago

I think this title says more about JSON than it does about Python.

j / k navigate · click thread line to collapse

84 comments

72 comments · 22 top-level

orlp4y ago· 22 in thread

* Which dialect for querying worked the best for you?

* Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.

* What do you feel is the biggest gap/missing feature in currently available tools?

sandGorgon4y ago

Store the json in postgresql as JSONB.

Not trying to flame - I do realise that duckdb is an alternative, but everyone in this space does this

Especially for data scientists, etc. The tooling flow around dbt, pandas, jupyter, postgresql and json is rock solid.

boredumb4y ago

I can second that using JSONB with postgres is fast and easy, the features and operators to query json structures are also about as extensive as you'll ever need.

1 more reply

orlp4y ago

> Not trying to flame

Please, flame away. I'm not here to put up a fight for one approach or another, simply interested in hearing what people think works well and what doesn't; what they'd expect/want.

1 more reply

bachmeier4y ago

Wouldn't that require setting up and running a postgres server? DuckDB is an sqlite replacement.

1 more reply

PaulHoule4y ago

arangodb for the win. I've done data analytics in it and OLTP.

You never hear hype about arangodb because arangodb users see it as a "secret weapon" and want their competitors to use mongo.

1 more reply

klabb34y ago

> * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, [...]

Yes (also known as jsonlines or line-separated JSON - .jsonl is the file extension).

And all of these good properties from such a ridiculously simple tech that it can be explained in a short sentence.

Downsides:

- Compression at the file level removes resumability (unless your compression is also resumable).

- You may have to implement your own log compaction if you have many updates/deletes.

gavinray4y ago

  > * Which dialect for querying worked the best for you?

If this is SQL dialects, I am partial to both the Postgres JSON/JSONB operators, SQLite has a solid JSON implementation too.

  > * Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.

"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:

https://github.com/simonw/datasette

From the same author, "sqlite-utils" generate SQLite table definitions and rows from similar:

https://github.com/simonw/sqlite-utils

I find this useful outside of SQLite because the syntax is similar across SQL databases. Great way to bootstrap a DB definition from existing datasets.

"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"

  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

For file, either JSONL/NDJSON (JSON object per line) if it's large, or a single array of objects if it's small.

In a SQL context, object-per-row undoubtedly

  > * What do you feel is the biggest gap/missing feature in currently available tools?

Don't have anything useful on this one unfortunately, but I'm not an analytics/data science person =(

dimatura4y ago

ellimilial4y ago

If it fits on a single machine - jq, flat files, JSON lines / avro if relatively flat. Change to a tabular format if when nesting not required.

Postgres JSONB works, but it requires maintaining a heavy server process. So does Lucene/elasticsearch.

cryptonector4y ago

JSONB is incredibly awesome, and should be extracted from PG and made usable on its own.

For those who don't know, JSONB is a binary JSON encoding that is specifically optimized for data at rest and compression thereof.

1 more reply

galdosdi4y ago

I cannot describe how much I love jq. Best new (to me) tool I discovered in all of 2020.

Once you get the hang of it the syntax feels extremely powerful. The only other thing it reminded me of is the first time I learned enough SQL to be dangerous.

You really, really are going to want to check jq out at least a little if you want to improve the state of the art in this area. It has an excellent manual btw.

Edit: you ask about "metaformats" such as object per newline. jq handles this well too

carlineng4y ago

Snowflake’s handling of JSON data via the “variant” data type is quite nice to work with, both in terms of speed and convenience of the access methods.

geoduck144y ago

I'll add to this:

The variant column is useful for "let's dump this data into this column and separate it out in the nextstep"

Snowflake comes with a couple of methods of parsing semi structured data like JSON, XML, and even "custom" formats

We are playing around with Snowflake at work right now and I quite like it

iknownothow4y ago

I know this is a big ask, but if DuckDB can be in lock step with Postgres on JSONB handling :french-chef-kiss:

[1] https://jsonlines.org/

peterohler4y ago

I am biased though as I wrote OjG to handle what other tools were not able to do.

dmouraOP4y ago

Are you able to calculate aggregates, like an average?

jamal-kumar4y ago

jimsparkman4y ago

Since duckdb is OLAP, it would directly compare with other columnar database technologies like Redshift, Presto/Athena, etc.

Most of these systems strongly encourage or outright enforce JSONL, so that’s the defacto standard, and most tooling or pipelines are going to generate that nowadays.

I’m not sure I have a particular favorite, only that the database support such an operation since it is a common occurrence.

infinite8s4y ago

As to JSON metaformat (I assume you mean for ingestion), a format of an array of objects works best.

zX41ZdbW4y ago

That's great, but you don't have to finish your PhD. The task is already solved in ClickHouse: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSS...

dorianmariefr4y ago

jq for quick lookup, jsonb for actual storage, indexing jsonb is a pain though

cryptonector4y ago

JSONB is optimized for compression, but also pessimized for writing.

What we should do is teach jq to read and write various binary JSON formats including JSONB.

dmouraOP4y ago· 7 in thread

For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language.

Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time:

    !spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json

    avg_agg_overall
    4.31181166791025 
    time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)


    import orjson as json
    acc = 0 
    cnt = 0 
    with open('books.json') as f: 
      for line in f: 
        acc += json.loads(line)['overall'] 
        cnt += 1 
    print(acc/cnt)

    4.31181166791025
    time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)

Olreich4y ago

11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query).

iknownothow4y ago

[1] https://jsonlines.org/

[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...

1 more reply

blumomo4y ago

dmouraOP4y ago

Thank you!! :-D

jazzyjackson4y ago

> I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high performance on core functionalities.

Tarq0n4y ago

Are you familiar with foreign function interfaces? [0] Threading can be handled Either by the host program or the C / Fortran code.

[0] https://docs.python.org/3/extending/extending.html

IshKebab4y ago

Feels a little disingenuous to highlight that it's written in Python when actually Rust is doing the heavy lifting.

davidatbu4y ago· 4 in thread

I asked about this on the Github issue regarding these benchmarks as well.

[0] https://github.com/ultrajson/ultrajson

[1] https://github.com/ijl/orjson

dmouraOP4y ago

davidatbu4y ago

> resulting in large programs with lots of boilerplate

That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.

Thank you for this benchmark! I'll probably switch to spyql now from jq.

> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

[0] https://github.com/google-research/bert

1 more reply

jammycrisp4y ago

> I should mention that spyql leverages orjson, which has a considerable impact on performance

    $ python bench_repodata_query.py 
    msgspec: 45.018014032393694 ms
    simdjson: 61.94157397840172 ms
    orjson: 105.34720402210951 ms
    ujson: 121.9699690118432 ms
    json: 113.79130696877837 ms

the_duke4y ago

When a Python tool is unexpectedly fast the answer is almost always: because the expensive part is implemented externally in a low level language.

flurly4y ago· 3 in thread

mritchie7124y ago

This is awesome. Assuming you saw that 22.3 has a JSON column type?

https://clickhouse.com/blog/clickhouse-22-3-lts-released/

ps - we're built on Clickhouse at Luabase (https://luabase.com/) and hiring across all role, send me a note (mike@luabase.com) if you're into web3+analytics.

qoega4y ago

Have you tried ClickHouse for dataset above? Interesting if it is faster or not

zX41ZdbW4y ago

It is appeared that clickhouse-local is faster than all of the rest, with a huge margin: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSS...

1 more reply

wdroz4y ago· 3 in thread

Why they didn't compare with Polars [0]? According to the h2oai benchmark [1], Polars is fast.

[0] -- https://www.pola.rs/ [1] -- https://h2oai.github.io/db-benchmark/

dmouraOP4y ago

Tarq0n4y ago

Polars is for tabular data, a different kind of task.

wdroz4y ago

But you can read json from polars (or from pandas first then polars) then you can do the map/reduce/filter like in the benchmark.

cube22224y ago· 2 in thread

Hey, OctoSQL author here.

Great to see such benchmarks and another dataset to test with!

Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.

It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)

And congrats to SPyQL for the first place here, its design is awesome.

[0]:https://github.com/cube2222/octosql

dmouraOP4y ago

Thanks @cube2222! Great! Sorry for overlooking that option. I will definitely add that option to octosql to make the comparison fair. Should I add it to the 3 queries?

cube22224y ago

1 more reply

procrastinatus4y ago· 2 in thread

Daniel Lemire’s simdjson probably belongs in this discussion and I would be surprised if it is not the fastest tool by some margin.

https://github.com/simdjson/simdjson

peterohler4y ago

dmouraOP4y ago

Wow, I will check it out, thanks!!

gabrielsroka4y ago· 2 in thread

Focusing on ease of use instead of performance (I'm usually looking at small files) but I found jq's syntax to be really difficult, so I wrote my own teeny tiny tool in Python:

https://replit.com/@gabrielsroka/Bash#pj.py

Obviously it only does a fraction of what jq does, but it does what I need and I can expand it in a fraction of the time it would take to learn jq.

Inspired by https://docs.python.org/3/library/json.html#module-json.tool

soheilpro4y ago

My main problem with jq was finding the paths to the nodes, so I wrote catj (https://github.com/soheilpro/catj) to help with that.

dmouraOP4y ago

Loved the simplicity of your solution :-) Congrats!

chewbacha4y ago· 1 in thread

I’m also not confident that the list was exhaustive.

dmouraOP4y ago

eatonphil4y ago· 1 in thread

Some like octosql and spyql IIRC have implemented their own sql parsing layer and engine. So if full sql support is necessary you may not be able to use them.

That said, despite performance not being the major I goal I think it will get better over time. There's already an OSS contributor who happens to be thinking about perf.

[0] https://github.com/multiprocessio/datastation

[1] https://github.com/multiprocessio/datastation/blob/main/GOOD...

dmouraOP4y ago

benob4y ago· 1 in thread

orjson, the fast underlying json parser, is written in rust.

shepardrtc4y ago

As soon as I saw Python, I knew they were using orjson.

irrational4y ago· 1 in thread

jamal-kumar4y ago

pmdulaney4y ago· 1 in thread

Cool! Let's port it to C and make it really fast!

dmouraOP4y ago

:-D out of curiosity, jq, one of the tools in the benchmark, is written in C

dmouraOP4y ago

Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:

* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);

* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.

This benchmark is a living document. If you are interested in receiving updates, please subscribe to the following issue: https://github.com/dcmoura/spyql/issues/72

Thank you!

PaulHoule4y ago

Tools like

https://www.alteryx.com/

are really limited in my view because they are based on a relational model that requires joins to do things that are easily expressed without joins in the object-relational (basically JSON) model.

If you were passing JSON documents between the nodes you can keep the stuff that would have been split out and then joined inside the document and the pipelines are a lot easier to maintain.

To do that though somebody has to ask the question of "what algebra does this thing work over?"

PaulHoule4y ago

That SpyQL is pretty cute... I like how it imports Python modules directly.

fbwaters4y ago

So for large files the bottleneck is encoding/decoding, which is done by orjson and not Python. If you wrap orjson in C++, the results would be even faster.

gourneau4y ago

TIL that time can also output max memory use, and IO use very useful!

Like they do here:

/usr/bin/time -o performance.log -a -f"%e\t%M\t%x\t%C"

%e (Not in tcsh.) Elapsed real time (in seconds).

%M Maximum resident set size of the process during its lifetime, in Kbytes.

%x (Not in tcsh.) Exit status of the command.

%C (Not in tcsh.) Name and command-line arguments of the command being timed.

bob10294y ago

Potentially relevant here...

https://lemire.me/blog/2018/05/03/how-fast-can-you-parse-jso...

http://www.vldb.org/pvldb/vol10/p1118-li.pdf

Dowwie4y ago

Actually, it's written in Rust and wrapped by python.

benibela4y ago

You could also try my Xidel. With the dev build 0.9.9, not the last release 0.9.8, because recently I have implemented my own hash map and JSON parser to make it faster than before.

Although it is not fast nor supports large files (it loads everything in memory)

1vuio0pswjnm74y ago

I think this title says more about JSON than it does about Python.

j / k navigate · click thread line to collapse