How linear regression works intuitively and how it leads to gradient descent (opens in new tab)

(briefer.cloud)

334 pointslucasfcosta1y ago101 comments

101 comments

64 comments · 19 top-level

jampekka1y ago· 10 in thread

The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution. Makes it a bit weird example for gradient descent.

There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.

The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.

jbjbjbjb1y ago

I’ve always felt that ML introductions completely butcher OLS. When I was taught it in stats we had to consider the Gauss-Markov conditions and interpret the coefficients, we would study the residuals. ML introductions just focus getting good predictions.

soVeryTired1y ago

IMO that's the fundamental difference between statistics and ML. The culture of stats is about fitting a model and interpreting the fit, while the culture of ML is to treat the model as a black box.

That's one of the reasons that multicollinearity is seen as a big deal by statisticians, but ML practitioners couldn't give a hoot.

3 more replies

orlp1y ago

This isn't true. In practice people don't use the analytical solution for efficient linear regression, they use stochastic methods.

Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.

em5001y ago

AFAIK using the analytic solution for linear regression (via lm in R, statsmodels in python or any other classical statistical package) is still the norm in traditional disciplines such as social (economics, psychology, sociology) and physical (bio/chemistry) sciences.

I think that as a field, Machine Learning is the exception rather than the norm, where people people start off or proceed rapidly to non-linear models, huge datasets and (stochastic) gradient based solvers.

Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.

jampekka1y ago

If by stochastic methods you mean something like MCMC, they are increasing in popularity, but still used a lot less than analytical or numerical methods. And almost exclusively only for more complicated models than basic linear regression. Sampling methods have major downsides, and approximation methods like ADVI are becoming more popular. Though sampling vs approximations is a bit off topic, as neither usually have closed form solutions.

Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.

Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.

Analytical solutions for squared errors have a long history for more or less all fields using regression and related models, and there's a lot of inertia for them. E.g. ANOVA is still the default method (although being replaced by multilevel regression) for many fields. This history is mainly due to the analytical convenience as they were computed on paper. That doesn't mean the normality assumption is not often justifiable. And when not directly, the traditional solution is to transform the variables to get (approximately) normally distributed ones for analytical solutions.

1 more reply

esafak1y ago

...because stochastic methods are implicit regularizers, leading to solutions that generalize better. Let's spell it out for those that don't know.

https://www.inference.vc/notes-on-the-origin-of-implicit-reg...

1 more reply

xadhominemx1y ago

That is incorrect. Least squares follows directly from the central limit theorem.

jampekka1y ago

Central limit theorem tells in practice that gaussian distributions is can be expected to be quite common. And it makes the gaussian distribution a good first guess. Least squares gives the ML estimate for gaussian residuals. I don't find this very direct, and there being a rationale doesn't mean that rationale is what in reality drives the usage.

I mention the relation to the gaussian distribution. Which part of the comment is incorrect?

1 more reply

easygenes1y ago

OLS is a straightforward way to introduce GD, and although an analytic solution exists it becomes memory and IO bound at sufficient scale, so GD is still a practical option.

jampekka1y ago

Computationally OLS is taking the pseudoinverse of the system matrix, which for dense systems has a complexity of O(samples * parameters^2). For some GD implementations the complexity of a single step is probably O(samples * parameters), so there could be a asymptotic benefit, but it's hard to imagine a case where the benefit is even realized, let alone makes a practical difference.

And in any case nobody uses GD for regressions for statistical analysis purposes. In practice Newton-Raphson or other more complicated schemes (with a lot higher computation, memory and IO demands) with a lot nicer convergence properties are used.

1 more reply

c7b1y ago· 7 in thread

One interesting property of least squares regression is that the predictions are the conditional expectation (mean) of the target variable given the right-hand-side variables. So in the OP example, we're predicting the average price of houses of a given size.

The notion of predicting the mean can be extended to other properties of the conditional distribution of the target variable, such as the median or other quantiles [0]. This comes with interesting implications, such as the well-known properties of the median being more robust to outliers than the mean. In fact, the absolute loss function mentioned in the article can be shown to give a conditional median prediction (using the mid-point in case of non-uniqueness). So in the OP example, if the data set is known to contain outliers like properties that have extremely high or low value due to idiosyncratic reasons (e.g. former celebrity homes or contaminated land) then the absolute loss could be a wiser choice than least squares (of course, there are other ways to deal with this as well).

Worth mentioning here I think because the OP seems to be holding a particular grudge against the absolute loss function. It's not perfect, but it has its virtues and some advantages over least squares. It's a trade-off, like so many things.

[0] https://en.wikipedia.org/wiki/Quantile_regression

easygenes1y ago

Yeah. Squared error is optimal when the noise is Gaussian because it estimates the conditional mean; absolute error is optimal under Laplace noise because it estimates the conditional median. If your housing data have a few eight-figure outliers, the heavy tails break the Gaussian assumption, so a full quantile regression for, say, the 90th percentile—will predict prices more robustly than plain least squares.

c7b1y ago

True. But it's worth mentioning that normality is only required for asymptotic inference. A lot of things that make least squares stand out, like being a conditional mean forecast, or that it's the best linear unbiased estimator, hold true regardless of the error distribution.

My impression is that many tend to overestimate the importance of normality. In practice, I'd worry more about other things. The example in the OP, eg, if it were an actual analysis, would raise concerns about omitted variables. Clearly, house prices depend on more factors than size, eg location. Non-normality here could be just an artifact of an underspecified model.

lupire1y ago

How does an upcoming college student, or worse an already graduate, learn statistics like this, with depth of understanding of the meaning of the math, vs just plug an chugging cookbook formulas and "proving" theorems mechanically without the deep semantics?

ayhanfuat1y ago

Statistical Rethinking is quite good in explaining this stuff. https://xcelab.net/rm/

1 more reply

monkeyelite1y ago

Dont take the “for engineers” version.

> and "proving" theorems mechanically

I think you’ve have a bad experience because writing a proof is explaining deep understanding.

1 more reply

c7b1y ago

I'd say reading about statistics and being curious is a great start :)

levocardia1y ago

Quantile regression is great, especially when you need more than just the average. A quantile model for, say, the 10th and 90th percentiles of something are really useful for decision-making. There is a great R package called qgam that lets you fit very powerful nonlinear quantile models -- one of R's "killer apps" that keeps me from using Python full-time.

easygenes1y ago· 6 in thread

This is very light and approachable but stops short of building the statistical intuition you want here. They fixate on the smoothness of squared errors without connecting that to the gaussian noise model and establishing how that relates to the predictive power against natural sorts of data.

akst1y ago

It isn't too hard to find resources on this for anyone genuinely looking to get a deeper understanding of a topic. I think a blog post (likely written for SEO purposes, which is in no way a knock against the content) is probably the wrong place that kind of enlightenment, but I also think there are limits to the level of detail you can reasonable expect from a high level blog post.

And for introductory content there's always that risk if you provide to much information you overwhelm the reader, make them feel like maybe this is too hard for them.

Personally I find the process of building a model is a great way of learning all this.

I think a course is probably helpful, but the problem with things like data camp is they are overly repetitive and they don't do a great job of helping you look up earlier content unless you want to scroll through a bunch of videos, where the formula goes on screen for 5 seconds.

Would definitely just recommend getting a book for that stuff, I found "All of statistics" good, I just wouldn't recommend trying to read it from cover to cover, but I have found it good as a manual where I could just look up the bits I needed when I needed it. Tho the book may be a bit intimidating if you're unfamiliar with integration and derivatives (as they often express the PDF/CDF of random variables in those terms).

jovial_cavalier1y ago

>I think a blog post... is probably the wrong place that kind of enlightenment

There's this site full of cool knowledgeable people called Hacker News which usually curates good articles with deep intuition about stuff like that. I haven't been there in years, though.

jfjfjtur1y ago

Yes, and it seems like it could’ve been written in-part by an LLM. But, the LLM could take your criticism, improve upon the original, and iterate that way until you feel that it has produced something close to an optimal textbook. The one thing missing is soul. I noticeably don’t feel like there was anyone behind this writing.

easygenes1y ago

Ah, we’re resorting to ad machinum today. :)

BlueUmarell1y ago

Any resource/link you know of that further develops your point?

easygenes1y ago

CMU lecture notes [0] I think approach it in an intuitive way, starting from the Gaussian noise linear model, deriving log-likelihood, and presenting the analytic approach. Misses the bridge to gradient methods though.

For gradients, Stanford CS229 [1] jumps right into it.

[0] https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lectu...

[1] https://cs229.stanford.edu/lectures-spring2022/main_notes.pd...

1 more reply

stared1y ago· 5 in thread

I really recommend this explorable explanation: https://setosa.io/ev/ordinary-least-squares-regression/

And for actual gradient descent code, here is an older example of mine in PyTorch: https://github.com/stared/thinking-in-tensors-writing-in-pyt...

revskill1y ago

Google search is evil by not giving me those resources.

stared1y ago

Yeah - I wanted to post it here, but after searching for "linear regression explorable explanation" I got some other random links. Thankfully, I saved the PyTorch materials + https://pinboard.in/u:pmigdal/t:explorable-explanation.

sorcerer-mar1y ago

This is an all-time great blog post for this line alone: "That's why we have statistics: to make us unsure about things."

The interactive visualizations are a great bonus though!

Nifty39291y ago

Google does however provide this very nice course that explains these things in more detail: https://developers.google.com/machine-learning/crash-course

mhb1y ago

Kagi FTW?

1 more reply

tibbar1y ago· 4 in thread

Some important context missing from this post (IMO) is that the data set presented is probably not a very good fit for linear regression, or really most classical models: You can see that there's way more variance at one end of the dataset. So even if we find the best model for the data that looks great in our gradient-descent-like visualization, it might not have that much predictive power. One common trick to deal with data sets like this is to map the data to another space where the distribution is more even and then build a model in that space. Then you can make predictions for the original data set by taking the inverse mapping on the outputs of the model.

levocardia1y ago

Non-constant variance does not actually bias the coefficients of a linear regression model -- thus, its predictions will be just fine. What it does is underestimate the standard errors; your p-values will typically be too small. Sometimes a log-transform or similar can help, but otherwise you can use weighted least-squares.

This kind of problem is actually a good intro to iterative refitting methods for regression models: How do you know what the weights should be? Well, you fit the initial model with no weights, get its residuals, use those to fit another model, rinse and repeat until convergence. A good learning experience and easy to hand-code.

SubiculumCode1y ago

In my work, I hardly ever use linear regression, but do use multiple linear regression. Multiple linear regression allows multiple linear predictors, where the method parses shared and independent variances associated with each predictor. These discussions on linear regression hardly ever touches on the very useful multiple linear regression method. In the case of bad variance inflation in models with multi-collinear predictors, robust regression techniques are advised like ridge, LASSO, or elastic net regression.

In relation to gradient descent, I do not know enough if multiple regression is at all relevant, or why not.

And yeah, for non-normal error distributions, we should be looking at generalized linear models, which allows one to specify other distributions that might better fit the data.

LPisGood1y ago

What you’re describing is the technique known as the “kernel trick”, correct?

levocardia1y ago

No, the kernel trick is something else: basically a nonlinear basis representation of the model. For example, fitting a polynomial model, or using splines, would effectively be using the "kernel trick" (though only ML people use that term, not statisticians, and usually they talk about it in the context of SVMs but it's fine for linear regression too). Transforming the data is just transforming the Y-outcome, most commonly with log(y) for things that tend to be distributed with a right-skew: house prices being a classic example, along with things like income, various blood biomarkers, or really anything that cannot go below zero but can (in principle) be arbitrarily large.

In a few rare cases I have found situations where sqrt(y) or 1/y is a clever and useful transform but they're very situational, often occurring when there's some physical law behind the data generation process with that sort of mathematical form.

2 more replies

jascha_eng1y ago· 4 in thread

The amount of em dashes in this make this look very AI written. Which doesn't make it a bad piece but makes me more carefully check every sentence for errors.

liamwire1y ago

I know this is repeated ad nauseam by now, but as an ardent user of em dashes for many years pre-LLM, I think this a bad heuristic.

lucasfcostaOP1y ago

Co-author and founder of Briefer here.

I used to use em dashes before they were cool. I actually learned about them when I emailed a guy who's a software engineer at Genius and also writes for The New Yorker and The Atlantic.

I asked him for tips on how to write well and he recommended that I read Steven Pinker's "The Sense of Style", which uses em dashes exhaustively, and explains when and why one should use them.

It also pains me that I can't use them anymore or else people will think an AI did the writing.

1 more reply

nabeelahmed131y ago

As another ardent user I actually think it is a good but unfortunate heuristic.

Previously I rarely saw it used in my English-as-second-language peer group, even by otherwise decent writers. Now I see it everywhere in personal/professional updates in my feed by. The simpler assumption is that people over-rely on LLMs for crafting these posts, and LLMs disproportionately use em dashes.

jwilber1y ago

This tired take is in every thread now. The sort of behavior better served by a Reddit bot, and just as annoying.

reify1y ago· 3 in thread

All thats wrong with the modern world

https://www.ibm.com/think/topics/linear-regression

A proven way to scientifically and reliably predict the future

Business and organizational leaders can make better decisions by using linear regression techniques. Organizations collect masses of data, and linear regression helps them use that data to better manage reality, instead of relying on experience and intuition. You can take large amounts of raw data and transform it into actionable information.

You can also use linear regression to provide better insights by uncovering patterns and relationships that your business colleagues might have previously seen and thought they already understood.

For example, performing an analysis of sales and purchase data can help you uncover specific purchasing patterns on particular days or at certain times. Insights gathered from regression analysis can help business leaders anticipate times when their company’s products will be in high demand.

uniqueuid1y ago

While I get your point, it doesn't carry too much weight, because you can (and we often read this) claim the opposite:

Linear regression, for all its faults, forces you to be very selective about parameters that you believe to be meaningful, and offers trivial tools to validate the fit (i.e. even residuals, or posterior predictive simulations if you want to be fancy).

ML and beyond, on the other hand, throws you in a whirl of hyperparameters that you no longer understand and which traps even clever people in overfitting that they don't understand.

Obligatory xkcd: https://xkcd.com/1838/

So a better critique, in my view, would be something that the JW Tukey wrote in his famous 1962 paper: (paraphrasing because I'm lazy):

"better to have an approximate answer to a precise question rather than an answer to an approximate question, which can always be made arbitrarily precise".

So our problem is not the tools, it's that we fool ourselves by applying the tools to the wrong problems because they are easier.

lupire1y ago

My maxim of statistics is that applied statistics is the art of making decisions under uncertainty, but people treat it like the science of making certainty out of data.

1 more reply

alexey-salmin1y ago

That particular xkcd was funny until the LLMs came around

2 more replies

sakras1y ago· 1 in thread

I intuitively think about linear regression as attaching a spring between every point and your regression line (and constraining the spring to be vertical). When the line settles, that's your regression! Also gives a physical intuition about what happens to the line when you add a point. Adding a point at the very end will "tilt" the line, while adding a point towards the middle of your distribution will shift it up or down.

A while ago I think I even proved to myself that this hypothetical mechanical system is mathematically equivalent to doing a linear regression, since the system naturally tries to minimize the potential energy.

cloud-oak1y ago

Perfect analogy! The cool part is that your model also gives good intuition about the gradient descent part. The springs' forces are the gradients, and the act of the line "snapping" into place is the gradient descent process.

Technically, physical springs will also have momentum and overshoot/oscillate. But even this is something that is used in practice, gradient descent with momentumg.

dalmo31y ago· 1 in thread

I don't have anything useful to say, but, how the hell is that a "12 min read"?

I always find those counters to greatly overestimate reading speed, but for a technical article like this it's outright insulting, to be honest.

Workaccount21y ago

It's the common trap of trying to teach, and why teaching is so much more difficult than it appears.

When you intimately understand a topic, you have an intuition that naturally paves over gaps and bumps. This is excellent for getting work done, but terrible for teaching. Your road from start to finish is 12 minutes, and without that knack for teaching, you are unable to see what that road looks like to a beginner.

rogue71y ago· 1 in thread

I built a small static web app [0] (with svelte and tensorflow js) that shows gradient descent. It has two kind of problems: wave (the default) and linear. In the first case, the algorithm learns y = ax + b ; in the second, y = cos(ax + b). The training data is generated from these functions with some noise.

I spent some time making it work with interpolation so that the transitions are smooth.

Then I expanded to another version, including a small neural network (nn) [1].

And finally, for the two functions that have a 2d parameter space, I included a viz of the loss [2]. You can click on the 2d space and get a new initial point for the descent, and see the trajectory.

Never really finished it, though I wrote a blog post about it [3]

[0] https://gradfront.pages.dev/

[1] https://f36dfeb7.gradfront.pages.dev/

[2] https://deploy-preview-1--gradient-descent.netlify.app/

[3] https://blog.horaceg.xyz/posts/need-for-speed/

JadeNB1y ago

> It has two kind of problems: wave (the default) and linear. In the first case, the algorithm learns y = ax + b ; in the second, y = cos(ax + b).

Are "first" and "second" switched here?

wodenokoto1y ago· 1 in thread

Speaking of linear regression, can any of you recommend an online course or book that deep dives into fitting linear models?

lmpdev1y ago

Most intro to stats courses will do

I did the Stats I -> II -> II pipeline at uni but you should be fitting basic linear models by the end of Stats I

quercusa1y ago· 1 in thread

This (housing prices) example seems really familiar. Was it used in Andrew Ng's original Coursera ML class?

jwilber1y ago

Housing price examples in regression are much, much older than Ng’s ML class.

itissid1y ago· 1 in thread

Another way to approach the explanation is understanding the data generating process i.e. the statistical assumptions of the process that generates the data. That can go a long way to understanding _analytically_ if linear regression model is a good fit(or what to change in it to make it work). And — arguably more importantly — also a reason why we frame linear regression as a statistical problem instead of an optimization one(or an analytical OLS) in the first place. I would argue understanding it from a statistical standpoint provides much better intuition to a practitioner.

The reason to look at statistical assumptions, is because we want to make probabilistic/statistical statements about the response variable, like how much is its central tendency and how much it varies as values of X change. The response variable is not easy to measure.

Now, one can easily determine, for example using OLS(or gradient descent), the point estimates for parameters of a line that needs to be fit to two variables X and Y, without using any probability or statistical theory. OLS is, in point of fact, just an analytical result and has nothing to do with theory of statistics or inference. The assumptions of simple linear regression are statistical assumptions which can be right or wrong but if they hold, help us in making inferences, like:

  - Is the response variable varying uniformly over values of another r.v., X(predictors)?

  - Assuming an r.v. Y what model can we make if its expectation is a linear function.

So why do we make statistical assumptions instead of just point estimates? Because all points of measurements can’t be certain and making those assumptions it is one way of quantifying uncertainty.. Indeed, going through history one finds that Regression's use outside experimental data(Galton 1885) was discovered much after least squares(Newton 1795-1809). The fundamental reasons to understand natural variations in data was the original motivation. In Galton's case he wanted to study hereditary traits like wealth over generations as well as others like height, status, intelligence( coincidentally its also what makes the assumptions of linear regression a good tool for studying this: I think it's the idea of Regression to the mean; Very Wealthy or very pool families don't remain so over a families generations, they regress towards the mean. So is the case with Societal Class, Intelligence over generations)

When you follow this arc of reasoning, you come to the following _statistical_ conditions the data must satisfy for linear assumptions to work(ish):

Linear mean function of the response variable conditioned on a value of X

E[Y|X=x] = \beta_0+\beta_1*x

Constant Variance of the response variable conditioned on a value of X

Var[Y|X=x] = \sigma^2 (OR ACTUALLY JUST FINITE ALSO WORKS WELL)

itissid1y ago

When you frame it as an optimization problem, like by optimizing the squares loss or cross entropy, you have decided that your data generating process(DGP), i.e. Y is:

- A Binomial/Multinomial random variable, which gives you the the cross entropy like loss function.

- Is a Normal random variable, which gives you the squared loss.

This point is where many ML text books skip to directly. Its not wrong to do this, but this is a much more narrow intuition of how regression works!

But there is no reason Y needs to follow those two DGPs (The process could be a poisson or a mean reverting process)! There is no reason to believe prima-facie and apriori that the Y|X is following those assumptions. This also gives motivation for using other kinds of models.

Its why you test weather those statistical assumptions carefully first using a bit of EDA and from it comes some appreciation and understanding of how linear regression actually works.

brrrrrm1y ago

> When using least squares, a zero derivative always marks a minimum. But that's not true in general ... To tell the difference between a minimum and a maximum, you'd need to look at the second derivative.

It's interesting to continue the analysis into higher dimensions, which have interesting stationary points that require looking at the matrix properties of a specific type of second order derivative (the Hessian) https://en.wikipedia.org/wiki/Saddle_point

In general it's super powerful to convert data problems like linear regression into geometric considerations.

geye12341y ago

Mathematical ignoramus writing here, but I have a long-term project to correct my ignorance of statistics so this seems a good place to start.

He isn't talking about how to calculate the linear regression, correct? He's talking about why using squared distances between data points and our line is a preferred technique over using absolute distances. Also, he doesn't explain why absolute distances produce multiple results I think? These aren't criticisms, I am just trying to make sure I understand.

ISTM that you have no idea how good your regression formula (y = ax + c) is without further info. You may have random data all over the place, and yet you will still come out with one linear regression to rule them all. His house price example is a good example of this: square footage is, obviously, only one of many factors that influence price -- and also the most easily quantified factor by far. Wouldn't a standard deviation be essential info to include?

Also, couldn't the fact that squared distance gives us only one result actually be a negative, since it can so easily oversimplify and therefore cut out a whole chunk of meaningful information?

throwaway77831y ago

In the same vein, Karpathy's video series "Neural Networks from zero to hero"[0] touches upon a lot of this and intuitions as well. One of the best introductory series (even if you ignore the neural net part of it) and brushes on gradients, differentiation and what it means intuitively.

[0] https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o

jwilber1y ago

See another interactive article explaining linear regression and gradient descent: https://mlu-explain.github.io/linear-regression/

setgree1y ago

Nice, thanks for sharing! I shared this with my HS calculus teacher :) (My model is that his students should be motivated to get machine learning engineering jobs, so they should be motivated to learn calculus, but who knows.)

buss_jan1y ago

Very neat, didn't know about Deming Regression. Hard to imagine a case now where it wouldn't be the more appropriate method.

j / k navigate · click thread line to collapse

101 comments

64 comments · 19 top-level

jampekka1y ago· 10 in thread

The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution. Makes it a bit weird example for gradient descent.

There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.

The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.

jbjbjbjb1y ago

soVeryTired1y ago

IMO that's the fundamental difference between statistics and ML. The culture of stats is about fitting a model and interpreting the fit, while the culture of ML is to treat the model as a black box.

That's one of the reasons that multicollinearity is seen as a big deal by statisticians, but ML practitioners couldn't give a hoot.

3 more replies

orlp1y ago

This isn't true. In practice people don't use the analytical solution for efficient linear regression, they use stochastic methods.

Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.

em5001y ago

Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.

jampekka1y ago

Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.

Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.

1 more reply

esafak1y ago

...because stochastic methods are implicit regularizers, leading to solutions that generalize better. Let's spell it out for those that don't know.

https://www.inference.vc/notes-on-the-origin-of-implicit-reg...

1 more reply

xadhominemx1y ago

That is incorrect. Least squares follows directly from the central limit theorem.

jampekka1y ago

I mention the relation to the gaussian distribution. Which part of the comment is incorrect?

1 more reply

easygenes1y ago

OLS is a straightforward way to introduce GD, and although an analytic solution exists it becomes memory and IO bound at sufficient scale, so GD is still a practical option.

jampekka1y ago

1 more reply

c7b1y ago· 7 in thread

[0] https://en.wikipedia.org/wiki/Quantile_regression

easygenes1y ago

c7b1y ago

lupire1y ago

ayhanfuat1y ago

Statistical Rethinking is quite good in explaining this stuff. https://xcelab.net/rm/

1 more reply

monkeyelite1y ago

Dont take the “for engineers” version.

> and "proving" theorems mechanically

I think you’ve have a bad experience because writing a proof is explaining deep understanding.

1 more reply

c7b1y ago

I'd say reading about statistics and being curious is a great start :)

levocardia1y ago

easygenes1y ago· 6 in thread

akst1y ago

And for introductory content there's always that risk if you provide to much information you overwhelm the reader, make them feel like maybe this is too hard for them.

Personally I find the process of building a model is a great way of learning all this.

jovial_cavalier1y ago

>I think a blog post... is probably the wrong place that kind of enlightenment

There's this site full of cool knowledgeable people called Hacker News which usually curates good articles with deep intuition about stuff like that. I haven't been there in years, though.

jfjfjtur1y ago

easygenes1y ago

Ah, we’re resorting to ad machinum today. :)

BlueUmarell1y ago

Any resource/link you know of that further develops your point?

easygenes1y ago

For gradients, Stanford CS229 [1] jumps right into it.

[0] https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lectu...

[1] https://cs229.stanford.edu/lectures-spring2022/main_notes.pd...

1 more reply

stared1y ago· 5 in thread

I really recommend this explorable explanation: https://setosa.io/ev/ordinary-least-squares-regression/

And for actual gradient descent code, here is an older example of mine in PyTorch: https://github.com/stared/thinking-in-tensors-writing-in-pyt...

revskill1y ago

Google search is evil by not giving me those resources.

stared1y ago

sorcerer-mar1y ago

This is an all-time great blog post for this line alone: "That's why we have statistics: to make us unsure about things."

The interactive visualizations are a great bonus though!

Nifty39291y ago

Google does however provide this very nice course that explains these things in more detail: https://developers.google.com/machine-learning/crash-course

mhb1y ago

Kagi FTW?

1 more reply

tibbar1y ago· 4 in thread

levocardia1y ago

SubiculumCode1y ago

In relation to gradient descent, I do not know enough if multiple regression is at all relevant, or why not.

And yeah, for non-normal error distributions, we should be looking at generalized linear models, which allows one to specify other distributions that might better fit the data.

LPisGood1y ago

What you’re describing is the technique known as the “kernel trick”, correct?

levocardia1y ago

2 more replies

jascha_eng1y ago· 4 in thread

The amount of em dashes in this make this look very AI written. Which doesn't make it a bad piece but makes me more carefully check every sentence for errors.

liamwire1y ago

I know this is repeated ad nauseam by now, but as an ardent user of em dashes for many years pre-LLM, I think this a bad heuristic.

lucasfcostaOP1y ago

Co-author and founder of Briefer here.

I used to use em dashes before they were cool. I actually learned about them when I emailed a guy who's a software engineer at Genius and also writes for The New Yorker and The Atlantic.

I asked him for tips on how to write well and he recommended that I read Steven Pinker's "The Sense of Style", which uses em dashes exhaustively, and explains when and why one should use them.

It also pains me that I can't use them anymore or else people will think an AI did the writing.

1 more reply

nabeelahmed131y ago

As another ardent user I actually think it is a good but unfortunate heuristic.

jwilber1y ago

This tired take is in every thread now. The sort of behavior better served by a Reddit bot, and just as annoying.

reify1y ago· 3 in thread

All thats wrong with the modern world

https://www.ibm.com/think/topics/linear-regression

A proven way to scientifically and reliably predict the future

You can also use linear regression to provide better insights by uncovering patterns and relationships that your business colleagues might have previously seen and thought they already understood.

uniqueuid1y ago

While I get your point, it doesn't carry too much weight, because you can (and we often read this) claim the opposite:

ML and beyond, on the other hand, throws you in a whirl of hyperparameters that you no longer understand and which traps even clever people in overfitting that they don't understand.

Obligatory xkcd: https://xkcd.com/1838/

So a better critique, in my view, would be something that the JW Tukey wrote in his famous 1962 paper: (paraphrasing because I'm lazy):

"better to have an approximate answer to a precise question rather than an answer to an approximate question, which can always be made arbitrarily precise".

So our problem is not the tools, it's that we fool ourselves by applying the tools to the wrong problems because they are easier.

lupire1y ago

My maxim of statistics is that applied statistics is the art of making decisions under uncertainty, but people treat it like the science of making certainty out of data.

1 more reply

alexey-salmin1y ago

That particular xkcd was funny until the LLMs came around

2 more replies

sakras1y ago· 1 in thread

cloud-oak1y ago

Technically, physical springs will also have momentum and overshoot/oscillate. But even this is something that is used in practice, gradient descent with momentumg.

dalmo31y ago· 1 in thread

I don't have anything useful to say, but, how the hell is that a "12 min read"?

I always find those counters to greatly overestimate reading speed, but for a technical article like this it's outright insulting, to be honest.

Workaccount21y ago

It's the common trap of trying to teach, and why teaching is so much more difficult than it appears.

rogue71y ago· 1 in thread

I spent some time making it work with interpolation so that the transitions are smooth.

Then I expanded to another version, including a small neural network (nn) [1].

And finally, for the two functions that have a 2d parameter space, I included a viz of the loss [2]. You can click on the 2d space and get a new initial point for the descent, and see the trajectory.

Never really finished it, though I wrote a blog post about it [3]

[0] https://gradfront.pages.dev/

[1] https://f36dfeb7.gradfront.pages.dev/

[2] https://deploy-preview-1--gradient-descent.netlify.app/

[3] https://blog.horaceg.xyz/posts/need-for-speed/

JadeNB1y ago

> It has two kind of problems: wave (the default) and linear. In the first case, the algorithm learns y = ax + b ; in the second, y = cos(ax + b).

Are "first" and "second" switched here?

wodenokoto1y ago· 1 in thread

Speaking of linear regression, can any of you recommend an online course or book that deep dives into fitting linear models?

lmpdev1y ago

Most intro to stats courses will do

I did the Stats I -> II -> II pipeline at uni but you should be fitting basic linear models by the end of Stats I

quercusa1y ago· 1 in thread

This (housing prices) example seems really familiar. Was it used in Andrew Ng's original Coursera ML class?

jwilber1y ago

Housing price examples in regression are much, much older than Ng’s ML class.

itissid1y ago· 1 in thread

  - Is the response variable varying uniformly over values of another r.v., X(predictors)?

  - Assuming an r.v. Y what model can we make if its expectation is a linear function.

When you follow this arc of reasoning, you come to the following _statistical_ conditions the data must satisfy for linear assumptions to work(ish):

Linear mean function of the response variable conditioned on a value of X

E[Y|X=x] = \beta_0+\beta_1*x

Constant Variance of the response variable conditioned on a value of X

Var[Y|X=x] = \sigma^2 (OR ACTUALLY JUST FINITE ALSO WORKS WELL)

itissid1y ago

When you frame it as an optimization problem, like by optimizing the squares loss or cross entropy, you have decided that your data generating process(DGP), i.e. Y is:

- A Binomial/Multinomial random variable, which gives you the the cross entropy like loss function.

- Is a Normal random variable, which gives you the squared loss.

This point is where many ML text books skip to directly. Its not wrong to do this, but this is a much more narrow intuition of how regression works!

Its why you test weather those statistical assumptions carefully first using a bit of EDA and from it comes some appreciation and understanding of how linear regression actually works.

brrrrrm1y ago

In general it's super powerful to convert data problems like linear regression into geometric considerations.

geye12341y ago

Mathematical ignoramus writing here, but I have a long-term project to correct my ignorance of statistics so this seems a good place to start.

Also, couldn't the fact that squared distance gives us only one result actually be a negative, since it can so easily oversimplify and therefore cut out a whole chunk of meaningful information?

throwaway77831y ago

[0] https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o

jwilber1y ago

See another interactive article explaining linear regression and gradient descent: https://mlu-explain.github.io/linear-regression/

setgree1y ago

buss_jan1y ago

Very neat, didn't know about Deming Regression. Hard to imagine a case now where it wouldn't be the more appropriate method.

j / k navigate · click thread line to collapse