* Which dialect for querying worked the best for you?
* Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.
* What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.
* What do you feel is the biggest gap/missing feature in currently available tools?
Not trying to flame - I do realise that duckdb is an alternative, but everyone in this space does this
Especially for data scientists, etc. The tooling flow around dbt, pandas, jupyter, postgresql and json is rock solid.
Please, flame away. I'm not here to put up a fight for one approach or another, simply interested in hearing what people think works well and what doesn't; what they'd expect/want.
You never hear hype about arangodb because arangodb users see it as a "secret weapon" and want their competitors to use mongo.
Yes (also known as jsonlines or line-separated JSON - .jsonl is the file extension).
I think JSON should be thought of as a message format, and messages should be small. Messages can be sent over a stream, without having the "whole" data-set. You also get append-only semantics on files (saving disk-space and virtually creating a stupid write-ahead log become as cheap & hassle-free as it gets), and if you need it you get resumability (say binary search on a file with monotonically increasing ids/timestamps etc).
Additionally, this format will work with minimal hickups when interoperating with other systems. YOU may have an advanced, efficient parser, but others may not. Most environments have a line-parser and a json deserializer. Large JSON-files become hard-to-manage and prompts for more and more complex query-like tools.
And all of these good properties from such a ridiculously simple tech that it can be explained in a short sentence.
Downsides:
- Compression at the file level removes resumability (unless your compression is also resumable).
- You may have to implement your own log compaction if you have many updates/deletes.
> * Which dialect for querying worked the best for you?
If this is SQL dialects, I am partial to both the Postgres JSON/JSONB operators, SQLite has a solid JSON implementation too. > * Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.
"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:https://github.com/simonw/datasette
From the same author, "sqlite-utils" generate SQLite table definitions and rows from similar:
https://github.com/simonw/sqlite-utils
I find this useful outside of SQLite because the syntax is similar across SQL databases. Great way to bootstrap a DB definition from existing datasets.
"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"
> * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.
For file, either JSONL/NDJSON (JSON object per line) if it's large, or a single array of objects if it's small.In a SQL context, object-per-row undoubtedly
> * What do you feel is the biggest gap/missing feature in currently available tools?
Don't have anything useful on this one unfortunately, but I'm not an analytics/data science person =(If I think I'll have to do at least a few more analyses in the future, perhaps with a growing dataset, I'll usually put the data into Sqlite. If possible I try to keep it simple, with a single table, even if it means a non-normalized schema. As for tooling I typically go with `dataset`, an sqlalchemy wrapper that's super simple to use, and makes it possible to also use raw sql if I need to. I haven't fully explored the JSON capabilities of sqlite itself, but have been meaning to. If duckdb gets similar features that would be certainly worth looking into.
In terms of "meta-format", I usually like object-per-line and array of objects. Easy to add more records, pretty self-explanatory. Maybe inefficient but if that becomes an issue then it's time to move away from just JSON.
In regards to querying language, I usually don't do anything too complicated so I don't think much about it. Having SQL (as when using dataset) is nice. I also have to use mongodb for some tasks, and I also find that query language good enough for most things I need to do, since I'm not usually dealing with highly relational data.
Postgres JSONB works, but it requires maintaining a heavy server process. So does Lucene/elasticsearch.
I have been yearning for embeddable store (in line with SQLite the support that both works and also keeps the data compressed like JSONB). I know there were some attempts, tried some of it those, mostly monstrosities).
For those who don't know, JSONB is a binary JSON encoding that is specifically optimized for data at rest and compression thereof.
The key feature in JSONB is that most internal pointers [from arrays and objects] to values are in the form of lengths, with every 32nd pointer being an offset. This comes from the observation that offsets will not repeat, therefore are difficult to compress w/ off the shelf compression algorithms, but length values will often be the same and thus be compressible. This means that iterating an array (say) requires 31 additions for every 32 elements to recover the offsets to those 31 elements' values.
The story of how they came to this optimization for compression is fascinating. IIRC they implemented an offsets-only JSONB and were very happy with it until they discovered that that form of JSONB did not compress anywhere near as well as expected, and since PG was close to shipping, a feverish hunt for the cause ensued that culminated in the fix of mostly-using-lengths-instead-of-offsets.
Once you get the hang of it the syntax feels extremely powerful. The only other thing it reminded me of is the first time I learned enough SQL to be dangerous.
You really, really are going to want to check jq out at least a little if you want to improve the state of the art in this area. It has an excellent manual btw.
Edit: you ask about "metaformats" such as object per newline. jq handles this well too
The variant column is useful for "let's dump this data into this column and separate it out in the nextstep"
Snowflake comes with a couple of methods of parsing semi structured data like JSON, XML, and even "custom" formats
We are playing around with Snowflake at work right now and I quite like it
I know this is a big ask, but if DuckDB can be in lock step with Postgres on JSONB handling :french-chef-kiss:
As far as I am aware (haven't researched), the standard way to get logs or metadata out of AWS Cloudwatch is in a JSON lines [1] format. Therefore, even if I were to write a custom JSON output format for my custom service, I would try my best to stick to the JSON lines format since I would have already written parsers and data connectors for AWS Cloudwatch. That's just me though.
I am biased though as I wrote OjG to handle what other tools were not able to do.
Most of these systems strongly encourage or outright enforce JSONL, so that’s the defacto standard, and most tooling or pipelines are going to generate that nowadays.
You can obviously still have a row of arrays, and different systems have slightly different approaches on how to deal with those. In Spark, this is referred to as “exploding”, in Presto you would cross join to unnest an array, in Redshift you can glob on the super type.
I’m not sure I have a particular favorite, only that the database support such an operation since it is a common occurrence.
As to JSON metaformat (I assume you mean for ingestion), a format of an array of objects works best.
For the last question, the main gap I find is trying to understand the 'structure' of json data that I didn't generate directly. There aren't great tools to show whether all the JSON records have the same nested structure with similar key names, etc.
What we should do is teach jq to read and write various binary JSON formats including JSONB.
For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language.
spyql can both leverage the python standard lib for parsing json (written in C) as well as orjson (written in Rust). In this benchmark we used the later, which shows considerable performance improvements. Still, query processing (expression evaluation, filtering, aggregations, etc) are implemented in Python. I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high perfomance on core functionalities.
Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time:
!spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json
avg_agg_overall
4.31181166791025
time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)
import orjson as json
acc = 0
cnt = 0
with open('books.json') as f:
for line in f:
acc += json.loads(line)['overall']
cnt += 1
print(acc/cnt)
4.31181166791025
time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query).
And the reason we use wrapped libraries in Python so often is because it’s abysmally slow to do anything in the interpreter. The average loop is 100x slower than it should be. The more math you do the worse it gets too. Most pure Python code is 1000x slower than it should be.
So my metrics of success in this scenario are based on that fact that I have to deal with 10-100 such queries in a project in my day job, so I would choose SpyQL to write and maintain a simple and readable 5 line query in under 5 minutes with decent performance to solve a trivial use case of computing an average.
P.S. I know the article is about performance and your response about Python being is slow is beyond accurate and yet I will always choose to use it because it is not ashamed to sit on the shoulders of the fast and ugly.
[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...
I'm aware of python leveraging compiled C or even FORTRAN to run certain functions, does it do message passing with a forked process or something else? I'm having a hard time googling for how it works since I don't know what it's called.
I'm curious as to why libraries like ultrajson[0] and orjson[1] weren't explored. They aren't command line tools, but neither is pandas (which is benchmarked) right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, `python -c "..."`)?
I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.
Thank you for this benchmark! I'll probably switch to spyql now from jq.
> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there.
A parallel I think is helpful to think about is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.
I'm not sure how such claims advance our understanding of the tradeoffs of programming languages. I initially thought that I was going to change my mind about my impression that "python is not a good tool to implement fast parsing/querying", but now I haven't, so I do think the title is a bit misleading.
Even with orjson, you're still paying the cost of creating a new PyObject for every node in the JSON blob. orjson is well engineered (as is the backing serde-json decoder), but any JSON decoder that isn't using naive algorithms is mostly bound by the cost of creating PyObjects. Allocating in Python is _slow_.
I wrote a quick benchmark (https://gist.github.com/jcrist/de29815389eaed4eaf5b24fbcfdab...) showing a handwritten query that accesses only a few fields in a 13 MiB JSON file. The same query is repeated with a number of different Python JSON libraries. Results:
$ python bench_repodata_query.py
msgspec: 45.018014032393694 ms
simdjson: 61.94157397840172 ms
orjson: 105.34720402210951 ms
ujson: 121.9699690118432 ms
json: 113.79130696877837 ms
While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.
You certainly pay a computation cost everytime you want to scan the data, but I think the flexibility more than makes up for it. You can take a read about how and why I built GraphJSON here if you're interested https://www.graphjson.com/guides/about
https://clickhouse.com/blog/clickhouse-22-3-lts-released/
ps - we're built on Clickhouse at Luabase (https://luabase.com/) and hiring across all role, send me a note (mike@luabase.com) if you're into web3+analytics.
[0] -- https://www.pola.rs/ [1] -- https://h2oai.github.io/db-benchmark/
Great to see such benchmarks and another dataset to test with!
Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.
OctoSQL supports retractions. That means an input for OctoSQL can not only be producing records, but also saying "I'm taking this record back", and OctoSQL will calculate the correct output based on that with minimal processing overhead required. It's a dataflow system in that sense.
To achieve that, it has to buffer data at the output to be able to display the current output state (which might have rows removed from it). It’s the same case for LIMIT which is right now very unoptimal in OctoSQL. Because the top 100 rows can actually change if you have an ORDER BY clause specified and new rows/retractions arrive - it buffers the whole output, and then displays the top 100 rows.
Running with the stream_native output format should drastically reduce the memory overhead and possibly processing time (I’ll post an update when the data is downloaded and I can compare), at least in the first case.
Obviously, OctoSQL could be much smarter about detecting whether a source can even produce retractions, and optimize it’s behavior based on that. No point in buffering if retractions are not possible (which is the case for i.e. a JSON file). This would make LIMIT and the output table much faster and lighter.
It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)
And congrats to SPyQL for the first place here, its design is awesome.
EDIT: Running the first test with `--output stream_native` (which just sends the productions and retractions as they come to the output, instead of buffering and sorting them) does indeed run 2-3x faster and use 50 MB RAM - instead of GB's.
If you want to add it to the first one, I won't oppose, but don't feel pressured to do it. Especially the third query does highlight a very realistict limitation of OctoSQL as it works right now, so it's only fair as it is.
https://replit.com/@gabrielsroka/Bash#pj.py
Obviously it only does a fraction of what jq does, but it does what I need and I can expand it in a fraction of the time it would take to learn jq.
Inspired by https://docs.python.org/3/library/json.html#module-json.tool
I’m also not confident that the list was exhaustive.
Some like octosql and spyql IIRC have implemented their own sql parsing layer and engine. So if full sql support is necessary you may not be able to use them.
In contrast dsq supports excel, avro, parquet, json, csv/tsv, nginx logs, apache logs, json newline, concatenated json, etc. And this list keeps growing. Dsq stores data in sqlite under the hood so you have access to the full sqlite dialect. Textql and q also do it this way.
That said, despite performance not being the major I goal I think it will get better over time. There's already an OSS contributor who happens to be thinking about perf.
If you're interested in getting started with OSS contributions btw I keep a page of good first projects [0]. If you know Go you're in a good place to contribute. I've already had a number of people take me up on this and merged a number of PRs by first-time OSS contributors.
[0] https://github.com/multiprocessio/datastation
[1] https://github.com/multiprocessio/datastation/blob/main/GOOD...
The title of this post is incorrect. It should simply be the title of the Colab notebook, "The fastest tools for querying large JSON datasets". While SPyQL is the fastest tool, it's not written entirely in Python.
* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);
* OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;
* SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);
* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.
This benchmark is a living document. If you are interested in receiving updates, please subscribe to the following issue: https://github.com/dcmoura/spyql/issues/72
Thank you!
are really limited in my view because they are based on a relational model that requires joins to do things that are easily expressed without joins in the object-relational (basically JSON) model.
"Boxes and lines" data pipelines that involve joins are a bear to maintain because of the structural instability: what looks like a little change to management can turn a very simple pipeline that goes in one direction to a pipeline that splits into four streams that have to get reassembled at the end.
If you were passing JSON documents between the nodes you can keep the stuff that would have been split out and then joined inside the document and the pipelines are a lot easier to maintain.
The people who develop those tools, however, are in love with the relational model because it is SIMDable, not branchy, easy to implement at hyper-speed so there is going to be an opportunity for people to make one that gets much better customer satisfaction.
To do that though somebody has to ask the question of "what algebra does this thing work over?"
It's a lot like list comprehensions on steroids but it does lack the high end features of SQL like joins and it has a simple query processing strategy that doesn't support query optimization like SQL.
It is convenient to incorporate the Python ecosystem (functions) but in the end it is not that strong because the query language doesn't understand Python. It reminds me of the Java Streams in the API which can represent all the blocks of the processing pipeline before you collect it, yet it doesn't buy very much because the map, filter, group, etc. functions are black boxes that it can't think globally about.
The Expression Tree version of LINQ in C# can do better (like compile what looks like a C# function to SQL) but it's disappointing that very few people really did anything with expression trees. (Give programmers the power of LISP and they say... Meh.)
Like they do here:
/usr/bin/time -o performance.log -a -f"%e\t%M\t%x\t%C"
%e (Not in tcsh.) Elapsed real time (in seconds).
%M Maximum resident set size of the process during its lifetime, in Kbytes.
%x (Not in tcsh.) Exit status of the command.
%C (Not in tcsh.) Name and command-line arguments of the command being timed.
https://lemire.me/blog/2018/05/03/how-fast-can-you-parse-jso...
http://www.vldb.org/pvldb/vol10/p1118-li.pdf
"We experimentally evaluate Mison using representative real-world JSON datasets and the TPC-H benchmark, and show that Mison produces significant performance benefits over the best existing JSON parsers; in some cases, the performance improvement is over one order of magnitude."
Although it is not fast nor supports large files (it loads everything in memory)