Stochastic gradient descent written in SQL (opens in new tab)

(maxhalford.github.io)

295 pointsLemaxoxo3y ago182 comments

182 comments

108 comments · 30 top-level

jackschultz3y ago· 45 in thread

>A machine learning algorithm which can be trained using SQL opens a world of possibilities. The model and the data live in the same space. This is as simple as it gets in terms of architecture. Basically, you only need a database which runs SQL.

First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.

Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

ISL3y ago

The challenge I, with limited knowledge, see with developing detailed algorithms in SQL is a lack of good testing, abstraction, and review tooling. Similarly for a lot of the user-defined-functions for the larger data warehouses (redshift, bigQuery, etc.)

dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.

adeelk933y ago

I've supported 3 different models over the years with inference implemented in SQL. First one I inherited, loved it so much that I implemented it twice again. Amazingly fast for TBs of data and no waiting on other tech teams.

That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.

1 more reply

AndrewKemendo3y ago

I mean this is an opportunity right? Build those things

1 more reply

spprashant3y ago

The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing, which means you end up creating a dozen copies of each function named (sp_v1, sp_v2,.., etc.). It much more harder to practice iterative development which the rest of software development seems implements effectively.

Also traditional relational databases have a way to go before they can support parallelized machine learning workloads. You do not have control or the ability to spin up threads or processes to boost your performance. You rely on the query processor to make those decisions, and depending on your platform the results will be mixed.

rukuu0013y ago

No comment re rDBs supporting parallelized ML but re stored procedures - if your workflow evolves to treat them as ‘1st class’ code assets they’ll be just the same as the rest of your code.

We always had them in version control, unit tested etc. The tools are there if you want to use them.

2 more replies

CRConrad3y ago

> The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing...

Shell scripts with the SQL as Here-Docs, in stead of stored procedures. Ordinary text files, eminently gittable. Also lets you insert varying table/column names as environment variables, in stead of the regex rigmarole in TFA. (Or was that in another post on his blog? Just returned from reading several of them.) Best of both worlds.

bob10293y ago

> Databases are much more powerful than we think

The older I get the more I agree with this.

There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients. You might think this is horrible and indeed many manifestations are. But, there lurk an opportunity for incredible elegance down this path, assuming you have the patience to engage in it.

giraffe_lady3y ago

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I did this for a side project a few months ago and even used postgrest to serve the page with correct headers for html. It felt simultaneously really cursed and obvious. Shit you could even use plv8 to run mustache or whatever in the db if you really wanted to piss people off.

1 more reply

CRConrad3y ago

> Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I built BI Web pages entirely in PL/SQL (using some Oracle modules I can't quite recall the nams of) over twenty years ago. Got a bit hairy to keep the meta-recursive stuff in your head when hard-coding JavaScript into it, calling the next such page for each value on the page, for drill-down to the next level... Not fun, but quite doable.

And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.

Foobar85683y ago

This PL/SQL Overlord is running banks https://www.avaloq.com/solutions/products/avaloq-core

noloblo3y ago

+1 sql is extremely elegant composable and is under rated

Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it

Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.

Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages

naasking3y ago

SQL has some positives but it is not composable. At all. This is because relations are not first-class values in SQL.

2 more replies

noloblo3y ago

Postgres has read write parallism that can scale across millions of read writes ; if ml model is inherent in the Postgres db it is indeed very elegant reminds me of the glory days of tsearch2 to do text search in Postgres for searching our internal document repository using nothing but Postgres and boy was it faster than Java based search systems

justsomeuser3y ago

I think general programming languages are better for general programs than SQL.

Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

But I agree that having the data and the program in the same process has benefits.

Writing programs in SQL is one way.

Another way is to move your data to your general program with SQLite.

I like using SQL for ACID, and queries as a first filter to get the data into my program where I may do further processing with the general language.

Serow2253y ago

Another is MS SQL Server, which lets you run .NET on the database server :D "you can author stored procedures, triggers, user-defined functions, user-defined types, and user-defined aggregates in managed code"

1 more reply

fifilura3y ago

> Type systems

SQL has types

> compilers

For what specifically do you need a compiler?

> debuggers

Some tasks - like the concurrency SQL enables - are just very difficult to debug with debuggers. It would be the same with any other language. What SQL does here though is to allow you to focus on the logic, not the actual concurrency,

> text editors, package managers

I feel like these two are just for filling up the space.

> C FFI

Many SQL engines have UDFs

1 more reply

CRConrad3y ago

> I think general programming languages are better for general programs than SQL. Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

Non sequitur. SQL is typed; SQL can be edited in any text editor; there are lots of SQL IDEs and, arguably, debuggers and package managers. Sure, the package managers are specific to each RDBMS, but so what? Npm is no use in COBOL either. And sure, the “debuggers”, to the extent they can be said to exist, are radically different from those of “conventional” – of other – programming languages. But again, so what? A Smalltalk debugger is no use to fix the output from Intel’s latest C compiler either, or vice versa.

IOW: There is no such “SQL vs programming languages” dichotomy. SQL is just another programming language, with its own strengths and weaknesses, just like all the rest of them. “The rest” are not distinguished from SQL by somehow magically all having the attributes you claim for them: Some have them, some don't; some have this but not that, some others, the other way around. Someone built all those IDEs and debuggers and package managers for (some / many / most of) those other languages; you can build them for SQL too.

wpietri3y ago

Another option is to use an approach like Prevayler: https://prevayler.org/

The basic notion is you keep your data hot in RAM and manage it directly. You make every change an object (or a command), and write that out serially to a log before you execute it. That gets you the ACID guarantees but with no I/O but linear writes, so it can be extremely fast.

It only makes sense when your data fits conveniently in RAM, but that's a lot of things.

dventimi3y ago

I don't consider querying a relational database, transforming its data, or validating its state transitions to be general programs. I consider those to be special purpose programs, which benefit from special purpose tools tailored for writing them. SQL is one such tool.

CreRecombinase3y ago

I spend most of my time in the parallel universe that is scientific computing/HPC. In this alternate reality SQL (not to mention databases) never really took off. Instead of scalable, performant databases, we have only the parallel filesystem. I'm convinced the reason contemporary scientific computing don't involve much SQL is sociological/path-dependency, but there are also very good technical reasons. Optimizing software in scientific computing involves two steps: 1) Think hard about your problem until you can frame it as one or more matrix multiplications 2) Plug that into a numerical linear algebra library The SQL abstraction (in my experience) takes you very much in the opposite direction.

wpietri3y ago

For sure. Anything done in SQL is running on top of a million lines or more of extremely complicated non-SQL code. If that works for a given use case, great, but if not, optimizing can get very challenging. I'd much rather deal with something closer to the metal.

_a_a_a_3y ago

Self-proclaimed database expert here. What a database is good at depends on what you're trying to get that database to do, at least in part.

Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.

SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.

In efficiency, it comes down largely to how the database is implemented and that also includes the capability of the optimiser. Both of these are out of your control. In my experience trying to turn a database into a number cruncher is just not going to work.

I guess that's long way round of me saying that I don't think I agree with you!

nerdponx3y ago

In the data warehouse / OLAP space, I think we are heading towards a world where the underlying data storage consists of partitioned Arrow arrays, and you can run arbitrary UDFs in arbitrary languages on the (read-only) raw data with zero copy and essentially no overhead other than the language runtime itself and marshalling of the data that is emitted from the UDF.

Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.

There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.

But in general yes, the world is gradually waking up to the idea that performance matters, and that data locality can be extremely important for performance, and that doing your computations and data processing on your warehouse data in-place is going to be a huge time and money saver in the longer term.

nonethewiser3y ago

> Databases are much more powerful than we think

And a function of what people think is attitudes towards working at the DB level. I see this often with ORM's in the web dev sphere (rather than Dat Science). Yes, ORM's are great but many people rely on them to completely abstract away the database and are terrified by raw sql or even query building. You also see it with services that abstract away the backend like PocketBase, Fireship, etc. Writing a window function or even a sub select looks like black magic to many.

I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.

boringg3y ago

Opposite here - dont like ORMs. Too much overhead - though i get their value.

7thaccount3y ago

Database first designs make a lot of sense in a lot of ways. I've worked for a company with an Oracle database that has SQL scripts that do all the validation and create text files for downstream usage. I think it makes more sense than a ton of Java, but there are pros and cons. One is that SQL is relational and the advanced stuff can be extra hard to troubleshoot if you don't have enough experience. Even those that can't code can usually understand a for loop and can think imperatively.

Unfortunately it's an expensive commercial product or I'd recommend you look at kdb+ if you work with time series data. The big banks use it and essentially put all thier latest RT data into kdb+ and then can write extremely succinct queries with a SQL-like syntax, but the ability to approach it far more programmatically than what is typically doable with something like PL-SQL. You can even write your neural network or whatever code in less than a page of code as the language of kdb+ is extremely powerful, although also basically incomprehensible until someone puts some time into learning it. It's extremely lightweight though, so very easy to deal with in an interactive fashion.

All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.

tomrod3y ago

Absolutely.

Consider too that PostgreSQL databases support different languages, like Python.

Loads of for-profit companies have tried to cash in on this. SAP HANA is one of the ones I've had recent experience with. It is unfortunately a poor implementation. The right architecture tends to be: put your model behind an API interface, not internal on the system. Train your model separately from production systems, and so on.

You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.

dchftcs3y ago

There is a recent trend in database research, ML in databases. Not sure how much an impact it can make though, the sweet spot is doing relatively linear stuff, arguably just a step up from analytical functions in queries, while cutting edge ML needs dedicated GPUs for compute load and often uses unstructured and even binary data.

roflyear3y ago

> Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

Yes, mostly development, deployment, etc.. concerns. I haven't ever seen an org that versions their SQL queries, unless they are in a codebase. The environment is just unfriendly towards that type of management. Nevermind testing! Things that have solutions but we haven't matured enough here, because that type of development has been happening in application code.

Also, SQL is generally more complex than application logic, because they are designed to do different things. What is a simple exercise in iteration or recursion can more easily become something a little more of a headache.

Problems that can be resolved, but they are problems.

daveguy3y ago

I would like for this to be the case. I was DBA of a Postgres database for LIMS almost 2 decades ago. Back then you could code functions for the database to execute on data and it was very powerful, but also very clunky. The tools to do software development in the database was not mature at all. A lot has changed in the past 20 years and SQL has evolved. Do you think SQL will expand that much or there will be APIs built into the database? Near-data functions are powerful and useful, but I would want my development environment to be more like version controlled code than "built-in".

I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.

Scubabear683y ago

With “traditional” RDBMS systems, putting a lot of code in the DB lead to a lot of scaling issues where you’d need gigantic machines with a lot of RAM and CPU: because the DB was doing so much work. It was expensive and clunky to get HA right.

In more modern DBs being distributed horizontally, this approach may see a rebound. The big “but” is still costs, in my experience in AWS as an example, managed Postgres Aurora was surprisingly expensive in terms of monthly cost.

dventimi3y ago

Why would running code in a database process be intrinsically more expensive than running it in some other process?

2 more replies

maCDzP3y ago

I agree. I took a course in databases and SQL and was blown away by its power. With CTE’s and PLSQL you can do a lot of stuff inside the database.

I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.

And I also believe it ended up being way faster than what I did in python.

swyx3y ago

had a very good chat with https://postgresml.org/ last week which is focusing on bringing ML to postgres: https://youtu.be/j8hE8-jZJGU

LemaxoxoOP3y ago

I'm watching it, it's really good. Montana makes a great point: you can move data to the models, or move the models to the data. Data is typically larger than models, so it makes sense to go with the latter.

1 more reply

629514133y ago

Junior developers like me were uncomfortable with SQL twenty years ago. Java ORM frameworks became popular because of the Object-Relational impedance. I kind of see the same kind of sentiment nowadays among newer generations but in Python&Co.

The success of the Apache Spark engine can at least partially be attributed to

* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)

* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master

somat3y ago

A dbms is really it's own operating system, usually this is hosted on another operating system, one that understands the hardware.

I remember one place I worked where we had several old graybeard programmers who considered the dbms[1] the operating system, as a unix sysadmin we had some interesting discussions as I was always confused and confined by the dbms and they felt the same about unix.

1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.

haolez3y ago

I've created a big system with only SQL in the past like you just said. I wouldn't do it again because of these two pain points that I had to deal with:

1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)

2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)

dventimi3y ago

Not challenging your experience but genuinely trying to learn from it, in broad strokes what kinds of things were you doing in these stored procedures?

margorczynski3y ago

The problem with this approach is that:

1) No static typing 2) Updating the logic requires migrations 3) You put the logic into the place which is the hardest to scale and many times a single point of failure 4) Cannot compose the code effectively and general verbosity

It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse

FpUser3y ago

It is tempting to combine web server, database and some imperative language with built in data oriented / SQL features in a single executable and call it an application server that would communicate with the outside world using for example JSON based RPC. I think there were / are some products in the area even with the built in IDE (like Wakanda).

fzeindl3y ago

Fully agree.

And using PostgREST [0] you can serve your postgreSQL database as REST-API. And if you throw foreign data wrappers / multicorn in the mix, you can map any other datasource into your postgreSQL-db as table.

[0] https://postgrest.org/en/stable/

1 more reply

jhd33y ago

> Databases are much more powerful than we think

and data has mass. One example of bringing the work to the data is https://madlib.apache.org/ (works on Postgres and Greenplum)

[Disclaimer - former employee of Pivotal]

actionfromafar3y ago

I have yet to see a decent IDE or system which allows great version control, unit testing and collaboration with SQL source code.

So I think a lot of the reluctance is from practical concerns.

noloblo3y ago

We store individual sql files in github and keep them in separate folders

This is very simple and scales well for our purposes

zackmorris3y ago· 5 in thread

This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

Has anyone converted stuff like gradient descent to set theory?

https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...

https://www.sqlshack.com/mathematics-sql-server-fast-introdu...

https://www.sqlshack.com/learn-sql-set-theory/

Right now AI algorithms kind of look imperative and stateful to me, like state machines. But there should be a functional or relational representation, where pattern matching would be used to derive the current state instead.

It's trivial to go from functional to imperative representation, but often nearly impossible to go the other way. That's why monadic (sorry if I'm mincing terms) logic is so troublesome. Mutable variables, futures/promises, async, etc can't be statically analyzed, which is why most code today is difficult to parallelize and stuff like C++ optimizers don't even increase speed to the next order of magnitude. But functional languages have nearly infinite optimization potential through divide and conquer approaches like sharding and scatter-gather arrays that can run internally without side effects. In other words, we can throw hardware at SQL for linear speedup since it's embarrassingly parallel, but might have limited success optimizing something like Python.

teruakohatu3y ago

> This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

I understand the sentiment but you can run ML on plain CPUs, in fact it is easier to do so, you just don't get the massive speed benefits of GPUs. Hugging face's public models are all run on CPUs to save costs.

> Right now AI algorithms kind of look imperative and stateful to me, like state machines.

The maths is not stateful or imperative, only our implementation. Much like how functional languages are executed on a stateful CPU.

Research in parallelization is ongoing.

hdespiritu3y ago

Your question "Has anyone converted stuff like gradient descent to set theory?" doesn't make sense from the perspective that gradient descent uses differential calculus to find min/max points of an objective function and differential calculus requires a lot of additional assumptions on top of set theory.

That being said, current deep learning libraries such as JAX and Pytorch use automatic differentiation to efficiently compute partial derivatives used for optimization algorithms such as gradient descent and it's not clear to me what the level of effort would be to convert that to something that could run efficiently in SQL?

zackmorris3y ago

Thank you, I knew that derivatives were used in gradient descent, but automatic differentiation is new to me. The Wikipedia article is fairly opaque compared to what I learned in school, but this tidbit stood out:

https://en.wikipedia.org/wiki/Automatic_differentiation#Impl...

Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.

Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.

Based on the state of software these days, I'm guessing that OO (the "bare hands" method) is what's mainstream. It would be far better IMHO to use SCT, since it's a universal solution that doesn't require manually refactoring programs.

But stuff like SCT might be considered metaprogramming, which seems to have fallen out of fashion. I grew up with C++ macros and templates, so I feel that this is somewhat tragic, although readability and collaboration are much better today. A modern example might be something like aspect-oriented programming (AOP):

https://en.wikipedia.org/wiki/Aspect-oriented_programming

I once used the AOP library AspectJ to trace a Java app's execution, since Java made the (unfortunate) choice to focus on objects rather than functions, which makes it generally a poor fit for data processing, due to its high use of mutable state within objects (mutable state is what limits most object-oriented projects to around 1 million lines). Meaning that I couldn't remember the program's context as I was stepping through it, and had to analyze traces instead. AspectJ allows one to hook into the code without modifying it, sort of like a debugger, so that stuff like function calls and variable mutations can be watched:

https://en.wikipedia.org/wiki/AspectJ

https://www.eclipse.org/aspectj/doc/released/progguide/index...

Looks like this might still be an open problem in Python:

https://stackoverflow.com/questions/12356713/aspect-oriented...

https://docs.spring.io/spring-python/1.2.x/sphinx/html/aop.h...

But it seems like SQL would be a good candidate for AOP:

https://stackoverflow.com/questions/12271588/aspect-oriented...

https://technology.amis.nl/it/aspect-oriented-programming-ao...

If we had that, maybe we could automatically generate derivatives for the set operations. Then either access them as variables in stored procedures, or possibly as something like views or via metadata stored somewhere like MySQL's INFORMATION_SCHEMA.

I don't really know, but maybe these breadcrumbs could be helpful.

mr_toad3y ago

Most ML algorithms are numerical approximations, so yes, they look procedural. You can try and find analytical solutions, but except for the simplest models they’re intractable.

dogcomplex3y ago

Replying to this to remind myself to research later, as this is a very important question

syats3y ago· 4 in thread

Just don't.

SQL:

- does not allow for easy and clean importing of modules/libraries

- is not easily to write tests for

- has limited support for a debugger

- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)

- clearly indicates in its name that it is a Query language.

Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.

If you want to operate near the data, use PL/Python for PostgreSQL.

EDIT: Fixed formatting.

dventimi3y ago

-PostgreSQL extensions are easy to include and use.

-pgTAP exists for testing.

-A large query in SQL is not made smaller but translating it into an ORM DSL.

-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?

nfw23y ago

> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data

If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.

1 more reply

syats3y ago

Re modules/libraries: I meant it is not easy to write a piece of SQL code, and then import it into several queries to reuse it, or lend it to someone else for use on their on schema. It is possible, yes, but seldom done, because it is hell. PostgreSQL extensions could be used for this purpose, but developing an extension requires a different set of SQL statements (or luckily, python or c) than those used by the user of the extension, which makes compounding them a bit hard. Not impossible, just hard to maintain,

About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.

Thanks for pointing out pgTAP, didn't know about this.

For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.

2 more replies

freilanzer3y ago

Also, there is MindsDB: https://mindsdb.com/

andrenotgiant3y ago· 4 in thread

This is really interesting, but a basic part I don't understand: What would it actually look like to run this on a live dataset?

If I understand correctly: you'd run the recursive query, it produces results for every step, effectively showing you the progression of output over time, and then once it hits "present day", it completes and stops?

How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?

LemaxoxoOP3y ago

Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.

hobs3y ago

In a non-streaming db What would prevent you from storing the result set and just using the last iteration to calculate the next?

2 more replies

sergius3y ago

Did you try running it using DuckDB?

1 more reply

noloblo3y ago

Does this work with Postgres?

1 more reply

civilized3y ago· 4 in thread

In the comments here so far, we see a pattern we've seen before. When someone suggests doing something in SQL, there's a lot of concern about SQL being a very limited programming language where it's hard to do proper engineering.

Here's I would really love to know: why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s? It seems analogous to if the general-purpose programming world stopped with C.

dventimi3y ago

I believe the PC revolution gave a few generations of programmers the mistaken belief that imperative languages with familiar features (Basic, Pascal, C, Perl, Python, Java, etc.) are the only or the best way to interact with computers. Forms of interaction developed earlier (Forth, Prolog, SQL, etc.) were learned later, grudgingly, if at all. SQL was viewed with disdain, a necessary evil to be held at arm's length. That stalled evolution of SQL, the success of other query languages, or really anything that didn't fit the Algol-style "general purpose programming language" blinders.

Mercifully, finally, the cold war against SQL in particular and against relational databases in general seems to be thawing. Articles like this one exemplify that trend

maxfurman3y ago

Plenty of SQL features have been added since the 70s, notably window functions (which TFA relies on heavily). Most of the major databases are from the 80s, and even new kid on the block MySQL has been around since the mid 90s.

SQL draws a very hard line between the expression of the query and the AST that is used in the actual implementation. Database vendors like this aspect because they are free to implement whichever optimizations they want, but application developers want to build queries programmatically and optimize them themselves before passing them to the db engine, hence the tension in threads like these.

bob10293y ago

> why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s?

Because SQL is effectively a domain-specific language. If you added 100+ additional keywords/functions/etc., do you think it would be easier or more difficult for the average developer to build something clean with it?

I look at SQL like a pile of bricks. You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.

CRConrad3y ago

> You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.

And now we know what's gone wrong with Lego for the last two-three (close to four?) decades.

Funny – and sad! – how popular this shit has been on sites like this, though. Dare one hope that that popularity might begin to wane? Idunno

Terretta3y ago· 3 in thread

Title here is wrong. Title in article and headings in article are right: ONLINE gradient descent

It's specifically not stochastic. From the article:

Online gradient descent

Finally, we have enough experience to implement online gradient descent. To keep things simple, we will use a very vanilla version:

- Constant learning rate, as opposed to a schedule.

- Single epoch, we only do one pass on the data.

- Not stochastic: the rows are not shuffled. ⇠ ⇠ ⇠ ⇠

- Squared loss, which is the standard loss for regression.

- No gradient clipping.

- No weight regularisation.

- No intercept term.

LemaxoxoOP3y ago

Hehe I was wondering if someone would catch that. Rest assured, I know the difference between online and stochastic gradient descent. I admit I used stochastic on Hacker News because I thought it would generate more engagement.

airstrike3y ago

Then just call it Non-stochastic Gradient Descent? You can't editorialize titles per HN guidelines

https://news.ycombinator.com/newsguidelines.html

1 more reply

westurner3y ago

What are some adversarial cases for gradient descent, and/or what sort of e.g. DVC.org or W3C PROV provenance information should be tracked for a production ML workflow?

Gradient descent: https://en.wikipedia.org/wiki/Gradient_descent

Stochastic gradient descent: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Online machine learning: https://en.wikipedia.org/wiki/Online_machine_learning

adversarial gradient descent site:github.com inurl:awesome : https://www.google.com/search?q=awesome+adversarial+gradient...

https://github.com/EthicalML/awesome-production-machine-lear...

Robust machine learning: https://en.wikipedia.org/wiki/Robustness_(computer_science)#...

Robust gradient descent

1 more reply

pradeepchhetri3y ago· 3 in thread

You might want to consider checking out ClickHouse which supports many ML functions natively:

- stochasticLinearRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)

- stochasticLogisticRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)

LemaxoxoOP3y ago

Thanks for the links, I wasn't aware of them. The Russians often seem to have a step ahead in the ML world.

UnCommonLisp3y ago

That is an outdated view, it is a global and culturally diverse company.

glogla3y ago

I always love DuckDB articles because I can learn what is new in ClickHouse.

krick3y ago· 2 in thread

From the start I assumed this is a nice playful "Hexing the technical interview" kinda joke. But given the tone of the article, and some of the comments here… Uh, this cannot be serious, right?

guhidalg3y ago

Why couldn't it be serious? Your database already has your data so you don't have to stream it to another machine for training. If you have a developed model that needs retraining on new data, save yourself the network bandwidth (and time!) and just make the DB retrain. This doesn't replace the exploration phase of model-building but if you already know gradient descent works for you then what's wrong with this approach?

sgu9993y ago

I'm very confused as well. Are we facing a wall of GPT-generated comments?

airstrike3y ago· 2 in thread

This is great. The only thing I dislike from this is using these variables to try to predict Adj Close when they are not at all correlated.

There are countless meaningful correlations in financial data that would have been just as easy to play around with. One truly valuable example would be to look at trading multiples of comparable companies. Sticking to P/E would be easier as P is easily observable and forward-looking EPS estimates are generally always available. This would limit the exercise to more mature companies than the ones commonly discussed on HN but would make it actually meaningful

LemaxoxoOP3y ago

Do you have some data/resources on this? I'm a total snowflake at this, but I'm willing to learn.

airstrike3y ago

In hindsight this is harder than it seems if you don't already have access to data, so I'll try to be as informative as possible in my response

It's hard to find this information out there, so here's ~all you need to know.

Data is usually behind paywalls, unfortunately. Industry standards are Bloomberg terminal (ridiculously expensive, 5 digits $), FactSet (very expensive, 4 digits), Capital IQ (expensive, not sure)... but there are a number of up-and-coming startups trying to disrupt the space so you may be able to grab data from them. I think https://atom.finance has a 7-day free trial you could use to play around with.

P/E simply means the company's _P_rice per share divided by _E_arnings per share. Cancel out the "per share" terms and you get total market capitalization (which is the value of the total equity) divided by net income (since "earnings per share" really means "net income per share")

So the "P" is easy to get. It's your Adj Close.

The "E" is trickier as it can mean a lot of things. Diluted EPS from financial statements? Last year's EPS? Management's guidance for EPS? None of those are actually correct even if they are all "EPS"

Importantly--and contrary to 99% of the info you will find online--the most relevant EPS number are forward estimates of EPS, usually for the next twelve months ("NTM"). That is based on an average or median of analyst estimates which is called "consensus". These are analysts from financial institutions who build their own little models based on their own views of where the business is going to go, informed by recent earnings, management's color in earnings calls and filings, etc.

Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.

So you're not going to go out there, read all these (paid) analyst reports, find their EPS, calculate the median, etc. Bloomberg, Capital IQ, FactSet do this for you and it's easily observable for the end user (that's their business).

The thing is, as you may have guessed, "next twelve months" are a moving target across time. Analysts usually provide estimates for the current fiscal year (i.e. FY 2023, ending 12/31/2023 for most companies) and the following year, ending 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for simplicity

You might be tempted to just take a moving average of these two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually "move forward in time" as the days pass. That sort of works (and definitely checks the box for a proof-of-concept like in your post) but for the sake of completeness, I'll just say that the right-er approach is to only "move forward in time" when new earnings are released. So it doesn't matter if we're in 1/1/2023 or 2/1/2023--what matter is what is the latest reported quarter. Take Coca-Cola for instance (https://www.bamsec.com/companies/21344/coca-cola-co). Let's roll the tape backward one year. They reported FY 2021 earnings on 2/22/2022, at which point analysts published new estimates in revised models, so on from that day forward until the next quarterly earnings we take 100% FY0_EPS + 0% FY1_EPS, in which these correspond to estimates for FY 2022 and FY 2023, respectively.

On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts published new estimates, and we now take 75% FY0_EPS + 25% FY1_EPS. On 7/1/2022, we move forward another quarter so 50% + 50%, then 25% + 75% starting on 10/26 and then back to square one with 100% + 0% except FY0_EPS now means FY 2023 vs FY 2022 previously, and FY1_EPS means FY 2024

So your table is something like (I'm making up numbers)

    +------------+--------+-----------+---------+---------+-------------+-------------+-------------+
    | Date       | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period      | NTM0_Weight | NTM1_Weight |
    | 01/01/2022 |     KO |     90.10 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/02/2022 |     KO |     91.14 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/03/2022 |     KO |     89.30 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/04/2022 |     KO |     91.09 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/05/2022 |     KO |     92.01 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/06/2022 |     KO |     89.05 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    ...
    | 07/02/2022 |     KO |     89.05 |   19.50 |   23.20 | Q2 2022     |        0.75 |        0.25 |

With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/E

Why is this useful? Because in theory you can take the average P/E of companies X, Y and Z in one industry and compare it to a fourth company W. Is W's P/E multiple above or below the industry average? You now know if they are over or undervalued, respectively, which means you know if you should buy or sell that stock (if you believe you picked the right "comparable" companies in that industry)

This is just one example... there are all sorts of similar analyses done daily in the financial services industry. I'm not saying it's easy to extract alpha from trading on these, but that's the framework

1 more reply

simonw3y ago· 2 in thread

I tried to replicate this in SQLite. The first few steps worked OK, e.g.

https://lite.datasette.io/?json=https://gist.github.com/simo...

(I replaced "figures" with "raw" due to the way Datasette Lite assigns a default table name to the imported JSON)

But the more complex recursive queries gave me this error and I'm not sure how to work around it:

    recursive reference in a subquery: state

E.g. https://lite.datasette.io/?json=https://gist.github.com/simo...

LemaxoxoOP3y ago

That's too bad, I would have expected it to work out of the box. Other than rewriting the query in a different way, I'm not sure I see an easy workaround. Are you still working on this?

simonw3y ago

No I managed to drag myself away from it and get back to what I'd been intending to work on!

glomgril3y ago· 1 in thread

Incredible post. I laughed when I saw the title, snickered at the first paragraph, and then proceeded to be blown away by the rest of it. Thought I was in for a joke and instead I'm thinking about the nature of ML Ops and what it's become.

hdespiritu3y ago

My sentiment exactly. The premise of the article comes across a little naive because there are so many fine-tuned libraries for specialized hardware architectures that already do this computation very efficiently.

However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.

Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.

Rickasaurus3y ago· 1 in thread

Please don't do this in prod

dventimi3y ago

Many ML models experience offline training, and some online training is fine-tuning a model that was pretrained offline. Hence, I see no particular reason to worry about doing this "in prod."

hubja3y ago· 1 in thread

Super intresting! Love the scrappy mindset ;)

LemaxoxoOP3y ago

Cheers, I learnt with the best :)

snookerdooker3y ago· 1 in thread

I challenge you to add more emojis to your bio ;)

LemaxoxoOP3y ago

Yes I know some people look down on that. I hope it doesn't take away the merits of the article for you hehe.

AndrewKemendo3y ago

This is genius I love it.

On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that

Thanks for the writeup

yttribium3y ago

At least one F500 company had iteratively reweighted least squares implemented in MS SQL as of ~20y ago. When all you have is a hammer...

agnosticmantis3y ago

Just to disambiguate, here machine learning = linear regression. So no auto-grad needed as gradients are explicit and simple. Still interesting though.

episode0x013y ago

From a comment by the author:

> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.

Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.

Would be interesting to see how it scaled with length and dimensionality

victorNicollet3y ago

We've been doing this type of gradient descent in production for a couple of years now, and we've been very happy with it. A few insights from my experience:

One of the main benefits of doing differentiable programming over a relational database is that you can use joins to traverse the normalized data schema, and this acts as a very effective prior for your model. For example, you can learn parameters associated to shirt colors by having a parameter in the color table, and joining it with the shirt sales table (through the shirts table). And knowledge of the table structure also lets you improve convergence, for example by adding a factor to compensate for the fact that not all colors have the same number of sales (and therefore have been updated fewer times during each epoch). Here's a paper from last year: https://arxiv.org/abs/2209.03771v1

Automatic differentiation is an absolute must-have. Once you start having a few complex joins and aggregations, it's too hard to derive the gradient by hand. Doing automatic differentiation on relational queries requires some adjustments (in particular, because you cannot use a tape), so you end up having to define a subset of relational algebra that is closed by automatic differentiation. The general ideas were presented in https://ceur-ws.org/Vol-2971/paper07.pdf

On the other hand, while it's possible to do SQL-to-SQL automatic differentiation, the resulting queries have poor performance, so it's better to go one step lower (at the level of the executor for an already-planned query) and perform the automatic differentiation there. And we had an excellent intern work with us on dedicated parallelization for gradient descent queries: https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-b...

Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.

I'm very interested in differentiable programming on relational languages, as the mainstream research concentrates on unstructured, low-information-density data (images/sound/text) when there are many domains where the data is more dense and structured, and (because of that structure) is stored in a database.

college_physics3y ago

Is there a risk that postgres becomes sentient in our lifetime?

gigatexal3y ago

I thought I was decent at SQL until I saw this. Nice article!

saydaark3y ago

This is interesting, but is not using the GPU.

freilanzer3y ago

Or use MindsDB: https://mindsdb.com/

Narew3y ago

There was also this neural network module in redis if you want to do training and inference in a redid DB. (quite old btw) https://github.com/antirez/neural-redis

Foobar85683y ago

R has been natively supported in sql server for a long while (At least since 2017), same for Java or python, e.g. Writing SP in these languages.

PaulHoule3y ago

Reminds me of the time I coded up an optimized nearest neighbor search in SQL.

yza3y ago

Next step would be to implement autodiff. If only PostgreSQL had CUDA support.

rlewkov3y ago

Uhhhh, ya know that Python has a library for that :-)

kunalgupta3y ago

god tier

kilgnad3y ago

An even MORE useful idea is to do this in CSS.

j / k navigate · click thread line to collapse

182 comments

108 comments · 30 top-level

jackschultz3y ago· 45 in thread

Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

ISL3y ago

dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.

adeelk933y ago

That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.

1 more reply

AndrewKemendo3y ago

I mean this is an opportunity right? Build those things

1 more reply

spprashant3y ago

rukuu0013y ago

No comment re rDBs supporting parallelized ML but re stored procedures - if your workflow evolves to treat them as ‘1st class’ code assets they’ll be just the same as the rest of your code.

We always had them in version control, unit tested etc. The tools are there if you want to use them.

2 more replies

CRConrad3y ago

bob10293y ago

> Databases are much more powerful than we think

The older I get the more I agree with this.

There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

giraffe_lady3y ago

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

1 more reply

CRConrad3y ago

> Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.

Foobar85683y ago

This PL/SQL Overlord is running banks https://www.avaloq.com/solutions/products/avaloq-core

noloblo3y ago

+1 sql is extremely elegant composable and is under rated

Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it

Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.

Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages

naasking3y ago

SQL has some positives but it is not composable. At all. This is because relations are not first-class values in SQL.

2 more replies

noloblo3y ago

justsomeuser3y ago

I think general programming languages are better for general programs than SQL.

Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

But I agree that having the data and the program in the same process has benefits.

Writing programs in SQL is one way.

Another way is to move your data to your general program with SQLite.

I like using SQL for ACID, and queries as a first filter to get the data into my program where I may do further processing with the general language.

Serow2253y ago

1 more reply

fifilura3y ago

> Type systems

SQL has types

> compilers

For what specifically do you need a compiler?

> debuggers

> text editors, package managers

I feel like these two are just for filling up the space.

> C FFI

Many SQL engines have UDFs

1 more reply

CRConrad3y ago

> I think general programming languages are better for general programs than SQL. Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

wpietri3y ago

Another option is to use an approach like Prevayler: https://prevayler.org/

It only makes sense when your data fits conveniently in RAM, but that's a lot of things.

dventimi3y ago

CreRecombinase3y ago

wpietri3y ago

_a_a_a_3y ago

Self-proclaimed database expert here. What a database is good at depends on what you're trying to get that database to do, at least in part.

Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.

SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.

I guess that's long way round of me saying that I don't think I agree with you!

nerdponx3y ago

Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.

There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.

nonethewiser3y ago

> Databases are much more powerful than we think

I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.

boringg3y ago

Opposite here - dont like ORMs. Too much overhead - though i get their value.

7thaccount3y ago

All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.

tomrod3y ago

Absolutely.

Consider too that PostgreSQL databases support different languages, like Python.

You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.

dchftcs3y ago

roflyear3y ago

> Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

Problems that can be resolved, but they are problems.

daveguy3y ago

I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.

Scubabear683y ago

dventimi3y ago

Why would running code in a database process be intrinsically more expensive than running it in some other process?

2 more replies

maCDzP3y ago

I agree. I took a course in databases and SQL and was blown away by its power. With CTE’s and PLSQL you can do a lot of stuff inside the database.

I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.

And I also believe it ended up being way faster than what I did in python.

swyx3y ago

had a very good chat with https://postgresml.org/ last week which is focusing on bringing ML to postgres: https://youtu.be/j8hE8-jZJGU

LemaxoxoOP3y ago

1 more reply

629514133y ago

The success of the Apache Spark engine can at least partially be attributed to

* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)

* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master

somat3y ago

A dbms is really it's own operating system, usually this is hosted on another operating system, one that understands the hardware.

1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.

haolez3y ago

I've created a big system with only SQL in the past like you just said. I wouldn't do it again because of these two pain points that I had to deal with:

1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)

2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)

dventimi3y ago

Not challenging your experience but genuinely trying to learn from it, in broad strokes what kinds of things were you doing in these stored procedures?

margorczynski3y ago

The problem with this approach is that:

It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse

FpUser3y ago

fzeindl3y ago

Fully agree.

[0] https://postgrest.org/en/stable/

1 more reply

jhd33y ago

> Databases are much more powerful than we think

and data has mass. One example of bringing the work to the data is https://madlib.apache.org/ (works on Postgres and Greenplum)

[Disclaimer - former employee of Pivotal]

actionfromafar3y ago

I have yet to see a decent IDE or system which allows great version control, unit testing and collaboration with SQL source code.

So I think a lot of the reluctance is from practical concerns.

noloblo3y ago

We store individual sql files in github and keep them in separate folders

This is very simple and scales well for our purposes

zackmorris3y ago· 5 in thread

This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

Has anyone converted stuff like gradient descent to set theory?

https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...

https://www.sqlshack.com/mathematics-sql-server-fast-introdu...

https://www.sqlshack.com/learn-sql-set-theory/

teruakohatu3y ago

> This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

> Right now AI algorithms kind of look imperative and stateful to me, like state machines.

The maths is not stateful or imperative, only our implementation. Much like how functional languages are executed on a stateful CPU.

Research in parallelization is ongoing.

hdespiritu3y ago

zackmorris3y ago

https://en.wikipedia.org/wiki/Automatic_differentiation#Impl...

Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.

Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.

https://en.wikipedia.org/wiki/Aspect-oriented_programming

https://en.wikipedia.org/wiki/AspectJ

https://www.eclipse.org/aspectj/doc/released/progguide/index...

Looks like this might still be an open problem in Python:

https://stackoverflow.com/questions/12356713/aspect-oriented...

https://docs.spring.io/spring-python/1.2.x/sphinx/html/aop.h...

But it seems like SQL would be a good candidate for AOP:

https://stackoverflow.com/questions/12271588/aspect-oriented...

https://technology.amis.nl/it/aspect-oriented-programming-ao...

I don't really know, but maybe these breadcrumbs could be helpful.

mr_toad3y ago

Most ML algorithms are numerical approximations, so yes, they look procedural. You can try and find analytical solutions, but except for the simplest models they’re intractable.

dogcomplex3y ago

Replying to this to remind myself to research later, as this is a very important question

syats3y ago· 4 in thread

Just don't.

SQL:

- does not allow for easy and clean importing of modules/libraries

- is not easily to write tests for

- has limited support for a debugger

- clearly indicates in its name that it is a Query language.

Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.

If you want to operate near the data, use PL/Python for PostgreSQL.

EDIT: Fixed formatting.

dventimi3y ago

-PostgreSQL extensions are easy to include and use.

-pgTAP exists for testing.

-A large query in SQL is not made smaller but translating it into an ORM DSL.

nfw23y ago

> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data

If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.

1 more reply

syats3y ago

Thanks for pointing out pgTAP, didn't know about this.

2 more replies

freilanzer3y ago

Also, there is MindsDB: https://mindsdb.com/

andrenotgiant3y ago· 4 in thread

This is really interesting, but a basic part I don't understand: What would it actually look like to run this on a live dataset?

How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?

LemaxoxoOP3y ago

hobs3y ago

In a non-streaming db What would prevent you from storing the result set and just using the last iteration to calculate the next?

2 more replies

sergius3y ago

Did you try running it using DuckDB?

1 more reply

noloblo3y ago

Does this work with Postgres?

1 more reply

civilized3y ago· 4 in thread

dventimi3y ago

Mercifully, finally, the cold war against SQL in particular and against relational databases in general seems to be thawing. Articles like this one exemplify that trend

maxfurman3y ago

bob10293y ago

> why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s?

CRConrad3y ago

And now we know what's gone wrong with Lego for the last two-three (close to four?) decades.

Funny – and sad! – how popular this shit has been on sites like this, though. Dare one hope that that popularity might begin to wane? Idunno

Terretta3y ago· 3 in thread

Title here is wrong. Title in article and headings in article are right: ONLINE gradient descent

It's specifically not stochastic. From the article:

Online gradient descent

Finally, we have enough experience to implement online gradient descent. To keep things simple, we will use a very vanilla version:

- Constant learning rate, as opposed to a schedule.

- Single epoch, we only do one pass on the data.

- Not stochastic: the rows are not shuffled. ⇠ ⇠ ⇠ ⇠

- Squared loss, which is the standard loss for regression.

- No gradient clipping.

- No weight regularisation.

- No intercept term.

LemaxoxoOP3y ago

airstrike3y ago

Then just call it Non-stochastic Gradient Descent? You can't editorialize titles per HN guidelines

https://news.ycombinator.com/newsguidelines.html

1 more reply

westurner3y ago

What are some adversarial cases for gradient descent, and/or what sort of e.g. DVC.org or W3C PROV provenance information should be tracked for a production ML workflow?

Gradient descent: https://en.wikipedia.org/wiki/Gradient_descent

Stochastic gradient descent: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Online machine learning: https://en.wikipedia.org/wiki/Online_machine_learning

adversarial gradient descent site:github.com inurl:awesome : https://www.google.com/search?q=awesome+adversarial+gradient...

https://github.com/EthicalML/awesome-production-machine-lear...

Robust machine learning: https://en.wikipedia.org/wiki/Robustness_(computer_science)#...

Robust gradient descent

1 more reply

pradeepchhetri3y ago· 3 in thread

You might want to consider checking out ClickHouse which supports many ML functions natively:

- stochasticLinearRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)

- stochasticLogisticRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)

LemaxoxoOP3y ago

Thanks for the links, I wasn't aware of them. The Russians often seem to have a step ahead in the ML world.

UnCommonLisp3y ago

That is an outdated view, it is a global and culturally diverse company.

glogla3y ago

I always love DuckDB articles because I can learn what is new in ClickHouse.

krick3y ago· 2 in thread

From the start I assumed this is a nice playful "Hexing the technical interview" kinda joke. But given the tone of the article, and some of the comments here… Uh, this cannot be serious, right?

guhidalg3y ago

sgu9993y ago

I'm very confused as well. Are we facing a wall of GPT-generated comments?

airstrike3y ago· 2 in thread

This is great. The only thing I dislike from this is using these variables to try to predict Adj Close when they are not at all correlated.

LemaxoxoOP3y ago

Do you have some data/resources on this? I'm a total snowflake at this, but I'm willing to learn.

airstrike3y ago

In hindsight this is harder than it seems if you don't already have access to data, so I'll try to be as informative as possible in my response

It's hard to find this information out there, so here's ~all you need to know.

So the "P" is easy to get. It's your Adj Close.

Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.

So your table is something like (I'm making up numbers)

    +------------+--------+-----------+---------+---------+-------------+-------------+-------------+
    | Date       | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period      | NTM0_Weight | NTM1_Weight |
    | 01/01/2022 |     KO |     90.10 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/02/2022 |     KO |     91.14 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/03/2022 |     KO |     89.30 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/04/2022 |     KO |     91.09 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/05/2022 |     KO |     92.01 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/06/2022 |     KO |     89.05 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    ...
    | 07/02/2022 |     KO |     89.05 |   19.50 |   23.20 | Q2 2022     |        0.75 |        0.25 |

With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/E

1 more reply

simonw3y ago· 2 in thread

I tried to replicate this in SQLite. The first few steps worked OK, e.g.

https://lite.datasette.io/?json=https://gist.github.com/simo...

(I replaced "figures" with "raw" due to the way Datasette Lite assigns a default table name to the imported JSON)

But the more complex recursive queries gave me this error and I'm not sure how to work around it:

    recursive reference in a subquery: state

E.g. https://lite.datasette.io/?json=https://gist.github.com/simo...

LemaxoxoOP3y ago

That's too bad, I would have expected it to work out of the box. Other than rewriting the query in a different way, I'm not sure I see an easy workaround. Are you still working on this?

simonw3y ago

No I managed to drag myself away from it and get back to what I'd been intending to work on!

glomgril3y ago· 1 in thread

hdespiritu3y ago

However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.

Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.

Rickasaurus3y ago· 1 in thread

Please don't do this in prod

dventimi3y ago

Many ML models experience offline training, and some online training is fine-tuning a model that was pretrained offline. Hence, I see no particular reason to worry about doing this "in prod."

hubja3y ago· 1 in thread

Super intresting! Love the scrappy mindset ;)

LemaxoxoOP3y ago

Cheers, I learnt with the best :)

snookerdooker3y ago· 1 in thread

I challenge you to add more emojis to your bio ;)

LemaxoxoOP3y ago

Yes I know some people look down on that. I hope it doesn't take away the merits of the article for you hehe.

AndrewKemendo3y ago

This is genius I love it.

On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that

Thanks for the writeup

yttribium3y ago

At least one F500 company had iteratively reweighted least squares implemented in MS SQL as of ~20y ago. When all you have is a hammer...

agnosticmantis3y ago

Just to disambiguate, here machine learning = linear regression. So no auto-grad needed as gradients are explicit and simple. Still interesting though.

episode0x013y ago

From a comment by the author:

> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.

Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.

Would be interesting to see how it scaled with length and dimensionality

victorNicollet3y ago

We've been doing this type of gradient descent in production for a couple of years now, and we've been very happy with it. A few insights from my experience:

Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.

college_physics3y ago

Is there a risk that postgres becomes sentient in our lifetime?

gigatexal3y ago

I thought I was decent at SQL until I saw this. Nice article!

saydaark3y ago

This is interesting, but is not using the GPU.

freilanzer3y ago

Or use MindsDB: https://mindsdb.com/

Narew3y ago

There was also this neural network module in redis if you want to do training and inference in a redid DB. (quite old btw) https://github.com/antirez/neural-redis

Foobar85683y ago

R has been natively supported in sql server for a long while (At least since 2017), same for Java or python, e.g. Writing SP in these languages.

PaulHoule3y ago

Reminds me of the time I coded up an optimized nearest neighbor search in SQL.

yza3y ago

Next step would be to implement autodiff. If only PostgreSQL had CUDA support.

rlewkov3y ago

Uhhhh, ya know that Python has a library for that :-)

kunalgupta3y ago

god tier

kilgnad3y ago

An even MORE useful idea is to do this in CSS.

j / k navigate · click thread line to collapse