I can't help but mention clickhouse-local tool: https://clickhouse.com/docs/en/operations/utilities/clickhou...
clickhouse-local is a single binary that enables you to perform fast data processing using SQL - effectively database features without a database. This tool supports the full breadth of ClickHouse functions, many popular file formats and recently added automatic schema inference. You can query not only local files, but also remote files (from S3/HDFS/static files accessed by URL). Moreover, clickhouse-local tool has interactive mode where you can create tables, play with data and do almost everything that you can do wih ordinary database. And let's not forget, this tool is written in C++, so it's incredibly fast.
Disclaimer: Work at ClickHouse
Disclaimer: I do NOT work for ClickHouse :)
As far as similar tools go, if you're interested, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].
DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.
dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.
OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate and display a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.
[0]: https://github.com/apache/arrow-datafusion
[1]: https://github.com/multiprocessio/dsq
[2]: https://github.com/cube2222/octosql
Disclaimer: Author of OctoSQL
The command line client also has some nifty features like syntax highlighting, and context-aware auto-complete that is coming in the next release.
[1] https://duckdb.org/docs/installation/
Disclaimer: working on DuckDB
The thing that worried me when looking into SQL-tools for CSV-files on the commandline, is the plethora of tools available, and it being hard to find one that feels solid and well-supported enough to become a "default" tool for many daily tasks.
I want to avoid investing a lot of time learning the ins and outs of a tool that might stop being developed in a year from now. I wish for something that can become the "awk of tomorrow", but based on SQL or something similar.
Does anyone have any experiences related to that? Is my worry warranted? Are some projects more well supported than others?
https://colab.research.google.com/github/dcmoura/spyql/blob/...
Disclaimer: Work at ClickHouse, whose tool is part of the benchmarking efforts linked to above.
SPyQL CLI is more oriented to work in harmony with the shell (piping), to be very simple to use and to leverage the Python ecosystem (you can import Python libs and use them in your queries).
- use python code in your queries
- import python libs (just install them with pip/conda)
- write your one UDFs in Python
- run OS commands from within the query (using os.system)
- have guaranty of row order (like in grep, sed, etc)
And there is more, please take a look at: https://spyql.readthedocs.io/en/latest/distinctive.html