First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.
Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?
dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.
That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.
Also traditional relational databases have a way to go before they can support parallelized machine learning workloads. You do not have control or the ability to spin up threads or processes to boost your performance. You rely on the query processor to make those decisions, and depending on your platform the results will be mixed.
We always had them in version control, unit tested etc. The tools are there if you want to use them.
Shell scripts with the SQL as Here-Docs, in stead of stored procedures. Ordinary text files, eminently gittable. Also lets you insert varying table/column names as environment variables, in stead of the regex rigmarole in TFA. (Or was that in another post on his blog? Just returned from reading several of them.) Best of both worlds.
The older I get the more I agree with this.
There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.
I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients. You might think this is horrible and indeed many manifestations are. But, there lurk an opportunity for incredible elegance down this path, assuming you have the patience to engage in it.
I did this for a side project a few months ago and even used postgrest to serve the page with correct headers for html. It felt simultaneously really cursed and obvious. Shit you could even use plv8 to run mustache or whatever in the db if you really wanted to piss people off.
> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.
I built BI Web pages entirely in PL/SQL (using some Oracle modules I can't quite recall the nams of) over twenty years ago. Got a bit hairy to keep the meta-recursive stuff in your head when hard-coding JavaScript into it, calling the next such page for each value on the page, for drill-down to the next level... Not fun, but quite doable.
And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.
Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it
Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.
Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages
Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.
But I agree that having the data and the program in the same process has benefits.
Writing programs in SQL is one way.
Another way is to move your data to your general program with SQLite.
I like using SQL for ACID, and queries as a first filter to get the data into my program where I may do further processing with the general language.
SQL has types
> compilers
For what specifically do you need a compiler?
> debuggers
Some tasks - like the concurrency SQL enables - are just very difficult to debug with debuggers. It would be the same with any other language. What SQL does here though is to allow you to focus on the logic, not the actual concurrency,
> text editors, package managers
I feel like these two are just for filling up the space.
> C FFI
Many SQL engines have UDFs
Non sequitur. SQL is typed; SQL can be edited in any text editor; there are lots of SQL IDEs and, arguably, debuggers and package managers. Sure, the package managers are specific to each RDBMS, but so what? Npm is no use in COBOL either. And sure, the “debuggers”, to the extent they can be said to exist, are radically different from those of “conventional” – of other – programming languages. But again, so what? A Smalltalk debugger is no use to fix the output from Intel’s latest C compiler either, or vice versa.
IOW: There is no such “SQL vs programming languages” dichotomy. SQL is just another programming language, with its own strengths and weaknesses, just like all the rest of them. “The rest” are not distinguished from SQL by somehow magically all having the attributes you claim for them: Some have them, some don't; some have this but not that, some others, the other way around. Someone built all those IDEs and debuggers and package managers for (some / many / most of) those other languages; you can build them for SQL too.
The basic notion is you keep your data hot in RAM and manage it directly. You make every change an object (or a command), and write that out serially to a log before you execute it. That gets you the ACID guarantees but with no I/O but linear writes, so it can be extremely fast.
It only makes sense when your data fits conveniently in RAM, but that's a lot of things.
Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.
SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.
In efficiency, it comes down largely to how the database is implemented and that also includes the capability of the optimiser. Both of these are out of your control. In my experience trying to turn a database into a number cruncher is just not going to work.
I guess that's long way round of me saying that I don't think I agree with you!
Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.
There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.
But in general yes, the world is gradually waking up to the idea that performance matters, and that data locality can be extremely important for performance, and that doing your computations and data processing on your warehouse data in-place is going to be a huge time and money saver in the longer term.
And a function of what people think is attitudes towards working at the DB level. I see this often with ORM's in the web dev sphere (rather than Dat Science). Yes, ORM's are great but many people rely on them to completely abstract away the database and are terrified by raw sql or even query building. You also see it with services that abstract away the backend like PocketBase, Fireship, etc. Writing a window function or even a sub select looks like black magic to many.
I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.
Unfortunately it's an expensive commercial product or I'd recommend you look at kdb+ if you work with time series data. The big banks use it and essentially put all thier latest RT data into kdb+ and then can write extremely succinct queries with a SQL-like syntax, but the ability to approach it far more programmatically than what is typically doable with something like PL-SQL. You can even write your neural network or whatever code in less than a page of code as the language of kdb+ is extremely powerful, although also basically incomprehensible until someone puts some time into learning it. It's extremely lightweight though, so very easy to deal with in an interactive fashion.
All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.
Consider too that PostgreSQL databases support different languages, like Python.
Loads of for-profit companies have tried to cash in on this. SAP HANA is one of the ones I've had recent experience with. It is unfortunately a poor implementation. The right architecture tends to be: put your model behind an API interface, not internal on the system. Train your model separately from production systems, and so on.
You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.
Yes, mostly development, deployment, etc.. concerns. I haven't ever seen an org that versions their SQL queries, unless they are in a codebase. The environment is just unfriendly towards that type of management. Nevermind testing! Things that have solutions but we haven't matured enough here, because that type of development has been happening in application code.
Also, SQL is generally more complex than application logic, because they are designed to do different things. What is a simple exercise in iteration or recursion can more easily become something a little more of a headache.
Problems that can be resolved, but they are problems.
I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.
In more modern DBs being distributed horizontally, this approach may see a rebound. The big “but” is still costs, in my experience in AWS as an example, managed Postgres Aurora was surprisingly expensive in terms of monthly cost.
I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.
And I also believe it ended up being way faster than what I did in python.
The success of the Apache Spark engine can at least partially be attributed to
* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)
* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master
I remember one place I worked where we had several old graybeard programmers who considered the dbms[1] the operating system, as a unix sysadmin we had some interesting discussions as I was always confused and confined by the dbms and they felt the same about unix.
1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.
1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)
2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)
1) No static typing 2) Updating the logic requires migrations 3) You put the logic into the place which is the hardest to scale and many times a single point of failure 4) Cannot compose the code effectively and general verbosity
It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse
And using PostgREST [0] you can serve your postgreSQL database as REST-API. And if you throw foreign data wrappers / multicorn in the mix, you can map any other datasource into your postgreSQL-db as table.
and data has mass. One example of bringing the work to the data is https://madlib.apache.org/ (works on Postgres and Greenplum)
[Disclaimer - former employee of Pivotal]
So I think a lot of the reluctance is from practical concerns.
This is very simple and scales well for our purposes
Has anyone converted stuff like gradient descent to set theory?
https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...
https://www.sqlshack.com/mathematics-sql-server-fast-introdu...
https://www.sqlshack.com/learn-sql-set-theory/
Right now AI algorithms kind of look imperative and stateful to me, like state machines. But there should be a functional or relational representation, where pattern matching would be used to derive the current state instead.
It's trivial to go from functional to imperative representation, but often nearly impossible to go the other way. That's why monadic (sorry if I'm mincing terms) logic is so troublesome. Mutable variables, futures/promises, async, etc can't be statically analyzed, which is why most code today is difficult to parallelize and stuff like C++ optimizers don't even increase speed to the next order of magnitude. But functional languages have nearly infinite optimization potential through divide and conquer approaches like sharding and scatter-gather arrays that can run internally without side effects. In other words, we can throw hardware at SQL for linear speedup since it's embarrassingly parallel, but might have limited success optimizing something like Python.
I understand the sentiment but you can run ML on plain CPUs, in fact it is easier to do so, you just don't get the massive speed benefits of GPUs. Hugging face's public models are all run on CPUs to save costs.
> Right now AI algorithms kind of look imperative and stateful to me, like state machines.
The maths is not stateful or imperative, only our implementation. Much like how functional languages are executed on a stateful CPU.
Research in parallelization is ongoing.
That being said, current deep learning libraries such as JAX and Pytorch use automatic differentiation to efficiently compute partial derivatives used for optimization algorithms such as gradient descent and it's not clear to me what the level of effort would be to convert that to something that could run efficiently in SQL?
https://en.wikipedia.org/wiki/Automatic_differentiation#Impl...
Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.
Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.
Based on the state of software these days, I'm guessing that OO (the "bare hands" method) is what's mainstream. It would be far better IMHO to use SCT, since it's a universal solution that doesn't require manually refactoring programs.
But stuff like SCT might be considered metaprogramming, which seems to have fallen out of fashion. I grew up with C++ macros and templates, so I feel that this is somewhat tragic, although readability and collaboration are much better today. A modern example might be something like aspect-oriented programming (AOP):
https://en.wikipedia.org/wiki/Aspect-oriented_programming
I once used the AOP library AspectJ to trace a Java app's execution, since Java made the (unfortunate) choice to focus on objects rather than functions, which makes it generally a poor fit for data processing, due to its high use of mutable state within objects (mutable state is what limits most object-oriented projects to around 1 million lines). Meaning that I couldn't remember the program's context as I was stepping through it, and had to analyze traces instead. AspectJ allows one to hook into the code without modifying it, sort of like a debugger, so that stuff like function calls and variable mutations can be watched:
https://en.wikipedia.org/wiki/AspectJ
https://www.eclipse.org/aspectj/doc/released/progguide/index...
Looks like this might still be an open problem in Python:
https://stackoverflow.com/questions/12356713/aspect-oriented...
https://docs.spring.io/spring-python/1.2.x/sphinx/html/aop.h...
But it seems like SQL would be a good candidate for AOP:
https://stackoverflow.com/questions/12271588/aspect-oriented...
https://technology.amis.nl/it/aspect-oriented-programming-ao...
If we had that, maybe we could automatically generate derivatives for the set operations. Then either access them as variables in stored procedures, or possibly as something like views or via metadata stored somewhere like MySQL's INFORMATION_SCHEMA.
I don't really know, but maybe these breadcrumbs could be helpful.
SQL:
- does not allow for easy and clean importing of modules/libraries
- is not easily to write tests for
- has limited support for a debugger
- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)
- clearly indicates in its name that it is a Query language.
Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.
If you want to operate near the data, use PL/Python for PostgreSQL.
EDIT: Fixed formatting.
-pgTAP exists for testing.
-A large query in SQL is not made smaller but translating it into an ORM DSL.
-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?
If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.
About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.
Thanks for pointing out pgTAP, didn't know about this.
For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.
If I understand correctly: you'd run the recursive query, it produces results for every step, effectively showing you the progression of output over time, and then once it hits "present day", it completes and stops?
How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?
Here's I would really love to know: why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s? It seems analogous to if the general-purpose programming world stopped with C.
Mercifully, finally, the cold war against SQL in particular and against relational databases in general seems to be thawing. Articles like this one exemplify that trend
SQL draws a very hard line between the expression of the query and the AST that is used in the actual implementation. Database vendors like this aspect because they are free to implement whichever optimizations they want, but application developers want to build queries programmatically and optimize them themselves before passing them to the db engine, hence the tension in threads like these.
Because SQL is effectively a domain-specific language. If you added 100+ additional keywords/functions/etc., do you think it would be easier or more difficult for the average developer to build something clean with it?
I look at SQL like a pile of bricks. You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.
And now we know what's gone wrong with Lego for the last two-three (close to four?) decades.
Funny – and sad! – how popular this shit has been on sites like this, though. Dare one hope that that popularity might begin to wane? Idunno
It's specifically not stochastic. From the article:
Online gradient descent
Finally, we have enough experience to implement online gradient descent. To keep things simple, we will use a very vanilla version:
- Constant learning rate, as opposed to a schedule.
- Single epoch, we only do one pass on the data.
- Not stochastic: the rows are not shuffled. ⇠ ⇠ ⇠ ⇠
- Squared loss, which is the standard loss for regression.
- No gradient clipping.
- No weight regularisation.
- No intercept term.
Gradient descent: https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent: https://en.wikipedia.org/wiki/Stochastic_gradient_descent
Online machine learning: https://en.wikipedia.org/wiki/Online_machine_learning
adversarial gradient descent site:github.com inurl:awesome : https://www.google.com/search?q=awesome+adversarial+gradient...
https://github.com/EthicalML/awesome-production-machine-lear...
Robust machine learning: https://en.wikipedia.org/wiki/Robustness_(computer_science)#...
Robust gradient descent
- stochasticLinearRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)
- stochasticLogisticRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)
There are countless meaningful correlations in financial data that would have been just as easy to play around with. One truly valuable example would be to look at trading multiples of comparable companies. Sticking to P/E would be easier as P is easily observable and forward-looking EPS estimates are generally always available. This would limit the exercise to more mature companies than the ones commonly discussed on HN but would make it actually meaningful
It's hard to find this information out there, so here's ~all you need to know.
Data is usually behind paywalls, unfortunately. Industry standards are Bloomberg terminal (ridiculously expensive, 5 digits $), FactSet (very expensive, 4 digits), Capital IQ (expensive, not sure)... but there are a number of up-and-coming startups trying to disrupt the space so you may be able to grab data from them. I think https://atom.finance has a 7-day free trial you could use to play around with.
P/E simply means the company's _P_rice per share divided by _E_arnings per share. Cancel out the "per share" terms and you get total market capitalization (which is the value of the total equity) divided by net income (since "earnings per share" really means "net income per share")
So the "P" is easy to get. It's your Adj Close.
The "E" is trickier as it can mean a lot of things. Diluted EPS from financial statements? Last year's EPS? Management's guidance for EPS? None of those are actually correct even if they are all "EPS"
Importantly--and contrary to 99% of the info you will find online--the most relevant EPS number are forward estimates of EPS, usually for the next twelve months ("NTM"). That is based on an average or median of analyst estimates which is called "consensus". These are analysts from financial institutions who build their own little models based on their own views of where the business is going to go, informed by recent earnings, management's color in earnings calls and filings, etc.
Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.
So you're not going to go out there, read all these (paid) analyst reports, find their EPS, calculate the median, etc. Bloomberg, Capital IQ, FactSet do this for you and it's easily observable for the end user (that's their business).
The thing is, as you may have guessed, "next twelve months" are a moving target across time. Analysts usually provide estimates for the current fiscal year (i.e. FY 2023, ending 12/31/2023 for most companies) and the following year, ending 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for simplicity
You might be tempted to just take a moving average of these two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually "move forward in time" as the days pass. That sort of works (and definitely checks the box for a proof-of-concept like in your post) but for the sake of completeness, I'll just say that the right-er approach is to only "move forward in time" when new earnings are released. So it doesn't matter if we're in 1/1/2023 or 2/1/2023--what matter is what is the latest reported quarter. Take Coca-Cola for instance (https://www.bamsec.com/companies/21344/coca-cola-co). Let's roll the tape backward one year. They reported FY 2021 earnings on 2/22/2022, at which point analysts published new estimates in revised models, so on from that day forward until the next quarterly earnings we take 100% FY0_EPS + 0% FY1_EPS, in which these correspond to estimates for FY 2022 and FY 2023, respectively.
On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts published new estimates, and we now take 75% FY0_EPS + 25% FY1_EPS. On 7/1/2022, we move forward another quarter so 50% + 50%, then 25% + 75% starting on 10/26 and then back to square one with 100% + 0% except FY0_EPS now means FY 2023 vs FY 2022 previously, and FY1_EPS means FY 2024
So your table is something like (I'm making up numbers)
+------------+--------+-----------+---------+---------+-------------+-------------+-------------+
| Date | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period | NTM0_Weight | NTM1_Weight |
| 01/01/2022 | KO | 90.10 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
| 01/02/2022 | KO | 91.14 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
| 01/03/2022 | KO | 89.30 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
| 01/04/2022 | KO | 91.09 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
| 01/05/2022 | KO | 92.01 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
| 01/06/2022 | KO | 89.05 | 20.00 | 24.00 | Q1 2022 | 1.00 | 0.00 |
...
| 07/02/2022 | KO | 89.05 | 19.50 | 23.20 | Q2 2022 | 0.75 | 0.25 |
With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/EWhy is this useful? Because in theory you can take the average P/E of companies X, Y and Z in one industry and compare it to a fourth company W. Is W's P/E multiple above or below the industry average? You now know if they are over or undervalued, respectively, which means you know if you should buy or sell that stock (if you believe you picked the right "comparable" companies in that industry)
This is just one example... there are all sorts of similar analyses done daily in the financial services industry. I'm not saying it's easy to extract alpha from trading on these, but that's the framework
https://lite.datasette.io/?json=https://gist.github.com/simo...
(I replaced "figures" with "raw" due to the way Datasette Lite assigns a default table name to the imported JSON)
But the more complex recursive queries gave me this error and I'm not sure how to work around it:
recursive reference in a subquery: state
E.g. https://lite.datasette.io/?json=https://gist.github.com/simo...However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.
Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.
On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that
Thanks for the writeup
> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.
Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.
Would be interesting to see how it scaled with length and dimensionality
One of the main benefits of doing differentiable programming over a relational database is that you can use joins to traverse the normalized data schema, and this acts as a very effective prior for your model. For example, you can learn parameters associated to shirt colors by having a parameter in the color table, and joining it with the shirt sales table (through the shirts table). And knowledge of the table structure also lets you improve convergence, for example by adding a factor to compensate for the fact that not all colors have the same number of sales (and therefore have been updated fewer times during each epoch). Here's a paper from last year: https://arxiv.org/abs/2209.03771v1
Automatic differentiation is an absolute must-have. Once you start having a few complex joins and aggregations, it's too hard to derive the gradient by hand. Doing automatic differentiation on relational queries requires some adjustments (in particular, because you cannot use a tape), so you end up having to define a subset of relational algebra that is closed by automatic differentiation. The general ideas were presented in https://ceur-ws.org/Vol-2971/paper07.pdf
On the other hand, while it's possible to do SQL-to-SQL automatic differentiation, the resulting queries have poor performance, so it's better to go one step lower (at the level of the executor for an already-planned query) and perform the automatic differentiation there. And we had an excellent intern work with us on dedicated parallelization for gradient descent queries: https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-b...
Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.
I'm very interested in differentiable programming on relational languages, as the mainstream research concentrates on unstructured, low-information-density data (images/sound/text) when there are many domains where the data is more dense and structured, and (because of that structure) is stored in a database.