Command-line data analytics (opens in new tab)

(danielcmoura.com)

103 pointsdmoura3y ago25 comments

25 comments

22 comments · 11 top-level

cube22223y ago· 6 in thread

SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!

As far as similar tools go, if you're interested, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].

DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.

dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.

OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate and display a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.

[0]: https://github.com/apache/arrow-datafusion

[1]: https://github.com/multiprocessio/dsq

[2]: https://github.com/cube2222/octosql

Disclaimer: Author of OctoSQL

mytherin3y ago

You may also want to have a look at the DuckDB command line client [1]. The shell itself is based on the SQLite client, and DuckDB can be used to natively query CSV and Parquet files. Using extensions, DuckDB can also query SQLite and Postgres databases, and query files over HTTPS and S3.

The command line client also has some nifty features like syntax highlighting, and context-aware auto-complete that is coming in the next release.

[1] https://duckdb.org/docs/installation/

Disclaimer: working on DuckDB

Beltalowda3y ago

One thing I really miss in the DuckDB is that it doesn't insert the entire query if you use C-p/Up arrow for multi-line queries (it just cycles through the lines of the query). This behaviour is inherited from SQLite, and it trips me up every time even after years of SQLite CLI usage.

dmouraOP3y ago

DuckDB is great! I love what you guys are building. The main gap for me is native support of JSON (lines), like you have for CSV and Parquet.

zX41ZdbW3y ago

Here is a comparison of various SQL engines in command line: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

mildbyte3y ago

And if you're looking for a similar experience (very fast analytical SQL queries) but over HTTP, for example, to power a public dashboard or a visualization, you can try ROAPI [0] or Seafowl [1], also built on top of DataFusion (disclaimer: working on Seafowl):

[0]: https://github.com/roapi/roapi

[1]: https://github.com/splitgraph/seafowl

cube22223y ago

That's what I thought about ROAPI as well, until I benchmarked it, and it ended up being very slow[0].

[0]: https://news.ycombinator.com/item?id=32970495

1 more reply

zX41ZdbW3y ago· 2 in thread

Does SPyQL have any advantages over clickhouse-local?

dmouraOP3y ago

Things you can do with SPyQL CLI that you can't with clickhouse local (AFAIK, top of my mind, not exhaustive):

- use python code in your queries

- import python libs (just install them with pip/conda)

- write your one UDFs in Python

- run OS commands from within the query (using os.system)

- have guaranty of row order (like in grep, sed, etc)

And there is more, please take a look at: https://spyql.readthedocs.io/en/latest/distinctive.html

reichardt3y ago

As shown in the first example you can pipe data into clickhouse-local: https://clickhouse.com/docs/en/operations/utilities/clickhou...

1 more reply

avogar3y ago· 1 in thread

SPyQL looks very promising, great work!

I can't help but mention clickhouse-local tool: https://clickhouse.com/docs/en/operations/utilities/clickhou...

clickhouse-local is a single binary that enables you to perform fast data processing using SQL - effectively database features without a database. This tool supports the full breadth of ClickHouse functions, many popular file formats and recently added automatic schema inference. You can query not only local files, but also remote files (from S3/HDFS/static files accessed by URL). Moreover, clickhouse-local tool has interactive mode where you can create tables, play with data and do almost everything that you can do wih ordinary database. And let's not forget, this tool is written in C++, so it's incredibly fast.

Disclaimer: Work at ClickHouse

qxip3y ago

I couldn't agree more. clickhouse-local is great as a CLI tool as well as a relay for web driven functions, delivering all the clickhouse functionality and speed for ad-hoc tasks with local or remote storage on S3, Parquet files, etc.

Disclaimer: I do NOT work for ClickHouse :)

samuell3y ago· 1 in thread

SPyQL looks fantastic!

The thing that worried me when looking into SQL-tools for CSV-files on the commandline, is the plethora of tools available, and it being hard to find one that feels solid and well-supported enough to become a "default" tool for many daily tasks.

I want to avoid investing a lot of time learning the ins and outs of a tool that might stop being developed in a year from now. I wish for something that can become the "awk of tomorrow", but based on SQL or something similar.

Does anyone have any experiences related to that? Is my worry warranted? Are some projects more well supported than others?

snthpy3y ago

It's a valid concern. I'm also working on a similar tool (which I'm not ready to reveal yet) and you're quite right. Only time will tell I guess.

pwallqvist3y ago· 1 in thread

Once your data is at a certain size, it might be worth considering tools that does the job quickly enough while still being simple to use. This comparison is very interesting:

https://colab.research.google.com/github/dcmoura/spyql/blob/...

Disclaimer: Work at ClickHouse, whose tool is part of the benchmarking efforts linked to above.

dmouraOP3y ago

Author of the benchmark and of SPyQL here. ClickHouse is fantastic. Amazing performance. SPyQL is built on top of Python but still can be faster than jq and several other tools as shown in the benchmark. SPyQL can handle large datasets but Clickhouse local should always show better performance.

SPyQL CLI is more oriented to work in harmony with the shell (piping), to be very simple to use and to leverage the Python ecosystem (you can import Python libs and use them in your queries).

gullywhumper3y ago

See also Jeroen Janssens' Data Science at the Command Line:

https://datascienceatthecommandline.com/2e/

beckingz3y ago

The best part is that doing analytics via the command line often means that you're doing analytics locally, which often gets you performance superior to a small computing cluster.

photochemsyn3y ago

Very useful, seems to be an effective bridging tool between relational and NoSQL database types, and from the command line! Nice clear documentation page as well.

ptsneves3y ago

Was I the only one thinking of something like google analytics but for command line? A system of usability telemetry for command line utilities might be useful?

thriftwy3y ago

Naturally, awk/sort/grep are often much more powerful than fiddling with fully qualified SQL.

mufty3y ago

Looks really interesting, certainly something i will enjoy playing with. Great work

j / k navigate · click thread line to collapse

25 comments

22 comments · 11 top-level

cube22223y ago· 6 in thread

SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!

As far as similar tools go, if you're interested, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].

DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.

dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.

[0]: https://github.com/apache/arrow-datafusion

[1]: https://github.com/multiprocessio/dsq

[2]: https://github.com/cube2222/octosql

Disclaimer: Author of OctoSQL

mytherin3y ago

The command line client also has some nifty features like syntax highlighting, and context-aware auto-complete that is coming in the next release.

[1] https://duckdb.org/docs/installation/

Disclaimer: working on DuckDB

Beltalowda3y ago

dmouraOP3y ago

DuckDB is great! I love what you guys are building. The main gap for me is native support of JSON (lines), like you have for CSV and Parquet.

zX41ZdbW3y ago

Here is a comparison of various SQL engines in command line: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

mildbyte3y ago

[0]: https://github.com/roapi/roapi

[1]: https://github.com/splitgraph/seafowl

cube22223y ago

That's what I thought about ROAPI as well, until I benchmarked it, and it ended up being very slow[0].

[0]: https://news.ycombinator.com/item?id=32970495

1 more reply

zX41ZdbW3y ago· 2 in thread

Does SPyQL have any advantages over clickhouse-local?

dmouraOP3y ago

Things you can do with SPyQL CLI that you can't with clickhouse local (AFAIK, top of my mind, not exhaustive):

- use python code in your queries

- import python libs (just install them with pip/conda)

- write your one UDFs in Python

- run OS commands from within the query (using os.system)

- have guaranty of row order (like in grep, sed, etc)

And there is more, please take a look at: https://spyql.readthedocs.io/en/latest/distinctive.html

reichardt3y ago

As shown in the first example you can pipe data into clickhouse-local: https://clickhouse.com/docs/en/operations/utilities/clickhou...

1 more reply

avogar3y ago· 1 in thread

SPyQL looks very promising, great work!

I can't help but mention clickhouse-local tool: https://clickhouse.com/docs/en/operations/utilities/clickhou...

Disclaimer: Work at ClickHouse

qxip3y ago

Disclaimer: I do NOT work for ClickHouse :)

samuell3y ago· 1 in thread

SPyQL looks fantastic!

Does anyone have any experiences related to that? Is my worry warranted? Are some projects more well supported than others?

snthpy3y ago

It's a valid concern. I'm also working on a similar tool (which I'm not ready to reveal yet) and you're quite right. Only time will tell I guess.

pwallqvist3y ago· 1 in thread

Once your data is at a certain size, it might be worth considering tools that does the job quickly enough while still being simple to use. This comparison is very interesting:

https://colab.research.google.com/github/dcmoura/spyql/blob/...

Disclaimer: Work at ClickHouse, whose tool is part of the benchmarking efforts linked to above.

dmouraOP3y ago

SPyQL CLI is more oriented to work in harmony with the shell (piping), to be very simple to use and to leverage the Python ecosystem (you can import Python libs and use them in your queries).

gullywhumper3y ago

See also Jeroen Janssens' Data Science at the Command Line:

https://datascienceatthecommandline.com/2e/

beckingz3y ago

The best part is that doing analytics via the command line often means that you're doing analytics locally, which often gets you performance superior to a small computing cluster.

photochemsyn3y ago

Very useful, seems to be an effective bridging tool between relational and NoSQL database types, and from the command line! Nice clear documentation page as well.

ptsneves3y ago

Was I the only one thinking of something like google analytics but for command line? A system of usability telemetry for command line utilities might be useful?

thriftwy3y ago

Naturally, awk/sort/grep are often much more powerful than fiddling with fully qualified SQL.

mufty3y ago

Looks really interesting, certainly something i will enjoy playing with. Great work

j / k navigate · click thread line to collapse