Why squared error? (2014) (opens in new tab)

(benkuhn.net)

256 pointsrpbertp139y ago99 comments

99 comments

73 comments · 23 top-level

bitL9y ago· 9 in thread

An honest question - do we even need statistics when we have machine learning? Statistics to me appears as a hack/aggregation of data we couldn't process at once in the past; these days ML + Big Data can achieve that and instead of statistics we can do computational inference instead. To me this looks like looking back to "old ways" for a reference point instead of looking forward to the unknown but more exciting.

highd9y ago

Sorry you're getting down-voted, I don't think it's an unreasonable question.

In the sense I think you're using it, "statistics" are really methods for dimensionality reduction - we take means, and medians and standard deviations with the hopes that they will capture the parts of the data we care about. This is important for two reasons - for one, for anything even moderately high dimension we'll never have enough data to be able to forego some means of aggregation due to the "curse of dimensionality". Secondly, the human-machine interaction information bandwidth is annoyingly low, so we need some way to compress any information for human consumption. "Statistics" are one way we do so.

"Statistics" is also a field of study based around understanding how multiple data points relate to each other - that is of course critical to machine learning, and I think the terminology collision is why you're getting downvoted.

klodolph9y ago

Machine learning is often considered a statistical technique. The main difference seems to be that in traditional statistics, people derive practice from theory, whereas in ML people will try out techniques and figure out the theory later. That's really just a cultural difference. The techniques for analyzing ML models are all statistical to begin with.

Statistics, as a field, already used general-purpose optimization algorithms before modern ML techniques came about, so in that sense, ML just fits into an existing position in the statistical toolbox (like replacing a chisel with a 3D printer). In the other direction, statistical techniques like cross-validation are necessary for you to get your ML correct.

bitL9y ago

There is much more in ML than just statistics. I was basically asking why the "statistics filter" is so often on in ML. Neural networks don't seem a statistical technique, even if somebody uses them for regression. Yes, there is an overlap, but no, ML != statistics. As you mentioned, non-linear optimization is used in statistics on meta-level however nobody claims statistics is operations research or vice versa.

2 more replies

_Wintermute9y ago

Yes we still need statistics. There is a huge overlap between machine learning methods and applied statistics, so much so that often there is not a clear distinction between the two.

j1vms9y ago

> between machine learning methods and applied statistics (...) often there is not a clear distinction between the two.

I would say applied statistics draws a line just prior to implementation concerns (say, real-world resource usage measured in time, space and energy) whereas these would be fully within scope and of interest in machine learning.

As an example, applied statistics could provide a useful approach to a vision/image recognition problem, and this approach might be provably unrealizable in practice using real-world execution units (e.g. CUDA cores). Nonetheless, it might still be a very worthwhile theoretical result in applied statistics, although of no immediate interest within ML except to hint at potential new area of research.

joeyo9y ago

It may not be true for all branches of machine learning (fuzzy logic, for example?), but the vast majority of modern ML techniques are equivalent to or can be viewed as types of statistical machine learning.

Normal_gaussian9y ago

Good point about the fuzzy logic, often the boundaries between it and statistics are... fuzzy

Normal_gaussian9y ago

ML + Big Data are a specific application of statistics

To to do anything beyond use tools other people have made (and never be sure whether results are meaningful or not) statistics are required

Of course, to make money from the ML boom you can probably get away with coincidence and correlation

bitL9y ago

Statistics means aggregate stuff and uses simplified characteristics out of semi-structured data. ML + Big Data allows you to ask precise questions like Where? How? Which ones?

1 more reply

throw_away_7779y ago· 6 in thread

There is a Kaggle competition right now that uses mean absolute error, and this makes the problem substantially harder. For a practical discussion of techniques used to solve machine learning problems that use mae see the forums in: https://www.kaggle.com/c/allstate-claims-severity/forums

As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.

haeffin9y ago

Mean absolute error is differentiable almost everywhere. Having objectives that are not differentiable, but are differentiable almost everywhere is very common - in a deep net, if you have rectified linear activations (very common) or L1 regularisation (not unheard of), you have an objective that is not differentiable everywhere ... but the methods still work.

thanatropism9y ago

No it isn't.

Differentiability is important if you want to have an closed-form formula and derive it in front of undergraduates.

throw_away_7779y ago

This is the difference between practice and theory. In theory differential objectives don't matter, in practice for medium to large datasets they make machine learning a lot faster. Speed is critical, as you need to be able to iterate quickly. The solution most commonly used on Kaggle is to transform the target feature and then minimize mean squared error, but there is some systematic uncertainty introduced by this.

hyperbovine9y ago

You can just use subgradient descent. Nonconvex loss would pose a bigger problem.

thomasahle9y ago

> As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.

I'm not sure the absolute value is a big problem here. You still get a convex optimization problem. In neural networks a lot of people use ReLU or step activations functions, which are no more differentiable than the absolute value.

nightcracker9y ago

What exactly would go wrong if you assume that the derivative is zero at x = 0?

And aren't exact zeroes an error scenario for most machine learning models anyway?

dnautics9y ago· 5 in thread

"inner products/gaussians" - the absolute value (and also cuberoot of absolute cubes, fourth root of fourth powers) also define inner products. Likewise, there are "gaussian-like formulas" which take these powers instead of squared.

However: if you look at the shape of the squareroot of sum squares, it's a circle, so you can rotate it. If you take the absolute, it's a square, so that cannot be rotated; the cuberoot of cubes and fourthroot of fourths, etc. look like rounded edge squares, and that cannot be rotated either, so if you have a change of vector basis, you're out of luck.

With the gaussian forms of other powers, none of them have the central limit property.

grodeni9y ago

What kind of inner products are defined by the absolute value, cuberoot of absolute cubes, fourth root of fourth powers? I never heard of that and would be glad to learn about it.

ska9y ago

You may find it interesting to read about Lp norms, and their relationship to inner products on vector spaces. I think the OP is mixing up norm and inner product terminology. This happens often because you derive an norm from any inner product, but the other way may not exist.

If you plot on the plane the distance = 1 line, then L_1 gives you a diamond, L_2 a circle, L_inf a square. [More precisely, the unit circle under the related metric (distance function) looks like those euclidean shapes]

Chinjut9y ago

They don't give inner products, but they do give norms. But inner products are, in some ways, more convenient than general norms, hence squared error as opposed to other things. It's not that squared error is necessarily what you fundamentally care about; it just happens to be so conveniently analyzed, because the mathematics of inner products is convenient.

lordnacho9y ago

It's possible he means Lp norms?

https://en.wikipedia.org/wiki/Norm_(mathematics)

dnautics9y ago

hah whoops! I did confuse inner products with norms. But it is true that the L2 norm is the only one that survives transformations to arbitrary unit basis vectors.

eanzenberg9y ago· 4 in thread

Why squared error? Because you can solve the equation to minimize squared error using linear algebra in closed form.

Why L2 regularization? Same reason. A closed form solution exists from linear algebra.

But at the end of the day, you are most interested in the expectation value of the coefficient and minimizing the squared error gives you E[coeffs] which is the mean of the coefficients.

bo10249y ago

I don't think this is any more convincing than the article's reasons. There are closed forms to lots of things that aren't interesting.

srean9y ago

I cannot speak for eanzenberg but I think his comment was less about his personal justification and more about the rationalizations that have been used in the history of stats.

Gauss quite openly admitted that the choice was borne out of convenience. The justification using Normal or Gaussian distribution came later and the Gauss Markov result on conditional distribution came even later.

Even at that time when Gauss proposed the loss, it was noted by many of Gauss' peers and (perhaps by Gauss himself) that other loss functions seem more appropriate if one goes by empirical performance, in particular the L1 distance.

Now that we have the compute power to deal with L1 it has come back with a vengeance and people have been researching its properties with renewed almost earnest. In fact there is a veritable revolution that's going on right now in the ML and stats world around it.

Just as optimizing the squared loss gives you conditional expectation, minimizing the L1 error gives you conditional median. The latter is to be preferred when the distribution has a fat tail, or is corrupted by outliers. This knowledge is no where close to being new. Gauss's peers knew this.

3 more replies

eanzenberg9y ago

I think just historically it's interesting. Every statistician was using OLS before computers because they could solve it with pen and paper, so when computers came out it was ported over. But with computers you can minimize any loss function.

However it is useful to have a closed form solution because it guarantees you actually minimized it. Other strategies to minimize functions don't guarantee that but they're still extremely useful.

lottin9y ago

> Because you can solve the equation to minimize squared error using linear algebra in closed form.

Exactly right. It has nothing to do with probability distributions.

tvural9y ago· 4 in thread

The best explanation is probably that squared error gives you the best fit when you assume your errors should normally distributed.

Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.

highd9y ago

"if the best model is not differentiable, you should still use it."

I'm not sure I would say that - neural nets are "near everywhere differentiable", for example. Without differentiability we're stuck with, for example, discrete GAs for optimization, and you can throw all your intuition out the window (not to mention training/learning efficiency).

gabrielgoh9y ago

A few misconceptions I should correct in this comment.

- There is plenty of existing technology for handling non-differentiable function. Functions like the absolute value, 2-norm, and so on have a generalization of the gradient (the subgradient) which can be used in lieu of the gradient.

- That functions are "almost everywhere differentiable" (i.e. the non-differentability lies in a manifold of zero measure) makes these functions behave pretty much like smooth ones. This is often not the case as optima often conspire to lie exactly on these nonsmooth manifolds.

2 more replies

throw_away_7779y ago

The fact that squared error is differentiable is not irrelevant. You can solve some machine learning models faster with differentiable objectives (most notably xgboost). Speed is important, you need to optimize your models and the longer it takes to run a model the less things you can try.

eanzenberg9y ago

Regardless of how distributed the errors are, the squared error fit will provide the expectation value of the variable, which is the mean. It will say nothing of the error of the mean it calculates.

shawnz9y ago· 4 in thread

I am no math expert, but I have always thought about it like this. The squared error is like weighting the error by the error. This causes one big error to be more significant than many small errors, which is usually what you want. Am I on the right track?

robotresearcher9y ago

> This causes one big error to be more significant than many small errors,

That's correct.

> which is usually what you want

Unless you have outliers, in which case it's what you don't want. So you add e.g. a Huber loss function to reach a compromise.

dajohnson899y ago

I just thought it was to give positive and negative error values the same treatment. Moreover I think that it's debatable that one big error is more important than many small errors. That is conceivably a bad strategy, in some cases -- if most points have low error, do you really want to penalize your candidate function for having a very few bad outliers? To me that is no better than giving extra favor to a few points that happen to have low error.

tomp9y ago

No, that's exactly why absolute error is better. "Big errors" are called outliers, they're (relatively) rare, often caused by bad data (measurement errors, typos, etc.) and substiantially influence the outcome of your calculation. In other words, squared error is less robust.

But squared error is easier to compute. So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

amelius9y ago

> So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

But if you are say fitting a function to the data, you can't tell beforehand which data-points are the outliers. So in that case perhaps you need an iterative approach of removing them (?)

gpsx9y ago· 3 in thread

For minimizing the square of the errors I think the good reason is because, assuming your data has gaussian probability distribution, minimizing the square error corresponds to maximizing the likelihood of the measurement, as you and others have said.

Why do we assume gaussian errors? There is seldom a gaussian distribution in the real world usually because the probability for large error values doesn't not decay that fast. We use it because the math is easy and we can actually solve the problem assuming that.

klodolph9y ago

That's a summary of the article.

gpsx9y ago

Yes, sort of. But I think he says a lot of unnecessary things not getting at the root of the issue.

I left out some detail I should have said, like what is so special about a gaussian that makes the math easy. So I will say it.

A measurement can infer a probability distribution for what the measured quantity is. A second measurement, on its own, also infers some probability distribution for what the measured quantity is. It we consider both measurements together, we get yet another probability distribution for what the measured quantity is. The magic is that if we had a gaussian distribution for the measurements, then the distribution for the combined measurements is also a gaussian. This is not true in general. As long as we have gaussian distributions we can do all the operations we want and the probability distributions are gaussian and can be fully described by a center point and a width. (Forgive me for the liberties I am taking here.) The basic alternative to exactly solving the problem is to actually try to carry around the probability distribution functions, which is not practical even with very powerful computers.

tpeo9y ago

I'm sorry, but what do you mean by "decay"?

You're talking about fat tails?

graycat9y ago· 3 in thread

I asked that early in my career.

We want a metric essentially because if we converge or have a good approximation in the metric then we are close in some important respects.

Squared error, then, gives one such metric.

But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc.

From 50,000 feet up, the reason for using squared error is that get to have the Pythagorean theorem, and, more generally, get to work in a Hilbert space, a relatively nice place to be, e.g., we also get to work with angles from inner products, correlations, and covariances -- we get cosines and a version of the law of cosines. E.g., we get to do orthogonal projections which give us minimum squared error.

With Hilbert space, commonly we can write the total error as a sum of contributions from orthogonal components, that is, decompose the error into contributions from those components -- nice.

The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal representation and decomposition, best squared error approximation.

We also like Fourier theory with squared error because of how it gives us the Heisenberg uncertainty principle.

Under meager assumptions, for real valued random variables X and Y, E[Y|X], a function of X, is the best squared error approximation of Y by a function of X.

Squared error gives us variance, and in statistics sample mean and variance are sufficient statistics for the Gaussian; that is, for statistics, for Gaussian data, can take the sample mean and sample variance, throw away the rest of the data, and do just as well.

For more, convergence in squared error can imply convergence almost surely at least for a subsequence.

Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice.

srean9y ago

Ah but square error is not a metric, its square root is a metric.

Many nice properties of the square loss (in fact un-fucking-believably nice properties) stem not from the fact that its square root is a metric but from the fact that it is a Bregman divergence. Another oft used 'divergence' in this class is KL divergence or cross-entropy.

Bregman introduced this class purely as a machinery to solve convex optimization problems. His motivation was to generalize the method of alternating projection to spaces other than a Hilbert space. But it so turned out that Bregman divergences are intimately connected with the exponential family class of distributions, also called the Pitman, Darmois, Koppman class of distribution. It takes some wracking of the brain to come up with a parametric family that does not belong in this class if one is caught unprepared, almost all parametric families used in stats (barring a few) belong to this class.

One may again ask why is this class so popular in probability and statistics, the answer is again convenience, they are almost as easy as Gaussians to work with, they have well behaved sufficient statistics, and their stochastic completion gives you the entire space 'regular' enough distributions with finite dimensional parameterizations.

You mentioned conditional expectation. So one may ask what are the loss functions that are minimized by conditional expectation. Bregman divergences are that entire class. Of course square loss satisfies it too (more importantly L2 metric on its own does not, it is the act of squaring it which does this).

Very interesting stuff (at least to me)

graycat9y ago

> Ah but square error is not a metric, its square root is a metric.

Yes, I was using "squared error" because the OP was. What I wrote was modulo a square root missing here and there!

neutralid9y ago

This is very interesting. Thank you.

What book would you recommend for this discussion?

2 more replies

fiatjaf9y ago· 3 in thread

Why geometric mean?, I would ask.

eanzenberg9y ago

Because it's very useful for symmetric distributions. If the distribution is highly non-symmetric, then the mean != maximum likelihood, which is probabilistic.

thomasahle9y ago

"Why addition?", I would ask.

Different problems, different tools. You can't ask "why geometric mean?" without referring to a specific problem you're trying to solve.

fiatjaf9y ago

What is a problem geometric mean solve? That was my question the entire time.

When people ask "why machine learning?" the answers are "machine learning can do these things blablabla", not "you must specify the problem you're trying to solve".

2 more replies

dschiptsov9y ago· 3 in thread

To make it positive and to amplify it (as a side-effect).

BTW, "error" is a misleading term - it communicates some fault, at least in the common sense. Distance would be much better term.

So, "squared distance" makes much more sense, because negative distance is nonsense.

tonyedgecombe9y ago

Well it will only amplify values > 1.

esrauch9y ago

That's not correct. Even though the magnitudes of the value in isolation shrinks, the relative magnitudes are still amplified which is what matters.

Consider values 1/2 and 1/4: in the original space it's double but in the squared space it becomes 1/4 and 1/16 so the difference is 4x. Also relevantly if you compare eg 0.9 and 1, the gap between them is amplified after squaring.

1 more reply

klodolph9y ago

That's a compelling case for why we should not use "distance", because distance cannot be negative, but the error term can.

Just look at bog-standard linear regressions, say Y_i = m X_i + b + ε_i. It makes no sense to call the ε_i terms "distance".

j7ake9y ago· 2 in thread

The Bayesian formulation for the likelihood function would make this squared error explicitly clear.

stared9y ago

For Gaussian uncertainty. Which still makes it a much more natural assumption than any other I know.

klodolph9y ago

Bayesian formulations are not necessary, the Gaussian is the maximum entropy distribution for known mean and squared error.

jostmey9y ago· 2 in thread

Why not KL-Divergence, which measures the error between a target distribution and the current distribution? From the perspective of Information Theory, it is the best error measurement.

Oh, and let's not forget that for a lot of problems minimizing the KL-divergence is the exact same operation as maximizing the likelihood function.

enthdegree9y ago

kl divergence has no nice theoretical properties other than 'it is the answer to these questions'

it is also extremely poorly behaved numerically and in convergence

srean9y ago

I am sorry but I have to call bullshit on this.

To give just a taste for the nice properties of KL, if you are using a layer 1 NN with the sigmoid function as the transform, using square loss gives you an explosion of local minima. OTOH using KL in its place would have given you none. Numerically accuracy is pretty much a non-issue, people have known how to handle KL numerically since the last 40 or so years.

BTW using KL on equivariant Gaussian gives you square loss, apparently the loss you prefer.

1 more reply

thomasahle9y ago· 1 in thread

It's fine to list some reasons for using squared error, but you really can't decide on the error function without referring to a problem you're trying to solve.

Just look at the success of compressed sensing, based on taking the absolute value error seriously.

Sean17089y ago

Which is basically the entire message of the last section.

heisenbit9y ago· 1 in thread

Square often corresponds to power in systems.

heisenbit9y ago

I noticed this got voted up and down more than usual. Maybe a little elaboration:

Square often corresponds to power/energy in systems AND energy (integral of power) is preserved. That relationship between physics and math allows a lot of useful transformations.

kazinator9y ago

Squared error represents the underlying belief that errors in various dimensions, or errors in independent samples, are linearly independent. So they add together like orthogonal vectors, forming a vector whose length is the square root of the sum of the squares. Minimizing the square error is a way of minimizing that square root without the superfluous operation of calculating it.

TeMPOraL9y ago

My explanation for squared error in linear approximation always was: because it minimizes the thickness of the line that passes through all the data points.

(Per the old math joke - you can make a line passing through any three points on a plane if you make it thick enough.)

theophrastus9y ago

Or why use variances when there are standard deviations (the square root of the variance) which have more easily interpreted units? One commonly cited reason is that one can sum variances from different factors, which one cannot do with standard deviations. There are other properties of variances which make them more suitable for continued calculations[1]. This is why, for instance, variances are often utilized in automated optimization packages.

[1] https://en.wikipedia.org/wiki/Variance#Properties

bagrow9y ago

Interesting discussion. Not sure about the breakdown between ridge regression and LASSO though. The difference is not in the error term but in the regularization term.

thisrod9y ago

Squared error because the uncertainties in independent, normally distributed random variables add in quadrature. I expect that this could be proved geometrically using Pythagoras's theorem, so in that sense the comments about orthogonal axes are vaguely on the right track.

Normally distributed variables because the central limit theorem.

It isn't all that complicated.

highd9y ago

Another pro tip - absolute error magnitude is the convex hull of non-zero entry count for vectors (l_0 norm in some circles). So in the convex minimization context (and for most other smooth loss terms in general) you end up with solutions with more zero entries and few possibly large non-zero entries.

adamzerner9y ago

Also see http://www.leeds.ac.uk/educol/documents/00003759.htm.

redcalx9y ago

Somewhat related; here's my attempt at explaining Cross Entropy:

http://heliosphan.org/cross-entropy.html

jayajay9y ago

cause linear algebra is a beautiful framework to think in.

j / k navigate · click thread line to collapse

99 comments

73 comments · 23 top-level

bitL9y ago· 9 in thread

highd9y ago

Sorry you're getting down-voted, I don't think it's an unreasonable question.

klodolph9y ago

bitL9y ago

2 more replies

_Wintermute9y ago

Yes we still need statistics. There is a huge overlap between machine learning methods and applied statistics, so much so that often there is not a clear distinction between the two.

j1vms9y ago

> between machine learning methods and applied statistics (...) often there is not a clear distinction between the two.

joeyo9y ago

Normal_gaussian9y ago

Good point about the fuzzy logic, often the boundaries between it and statistics are... fuzzy

Normal_gaussian9y ago

ML + Big Data are a specific application of statistics

To to do anything beyond use tools other people have made (and never be sure whether results are meaningful or not) statistics are required

Of course, to make money from the ML boom you can probably get away with coincidence and correlation

bitL9y ago

Statistics means aggregate stuff and uses simplified characteristics out of semi-structured data. ML + Big Data allows you to ask precise questions like Where? How? Which ones?

1 more reply

throw_away_7779y ago· 6 in thread

As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.

haeffin9y ago

thanatropism9y ago

No it isn't.

Differentiability is important if you want to have an closed-form formula and derive it in front of undergraduates.

throw_away_7779y ago

hyperbovine9y ago

You can just use subgradient descent. Nonconvex loss would pose a bigger problem.

thomasahle9y ago

> As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.

nightcracker9y ago

What exactly would go wrong if you assume that the derivative is zero at x = 0?

And aren't exact zeroes an error scenario for most machine learning models anyway?

dnautics9y ago· 5 in thread

With the gaussian forms of other powers, none of them have the central limit property.

grodeni9y ago

What kind of inner products are defined by the absolute value, cuberoot of absolute cubes, fourth root of fourth powers? I never heard of that and would be glad to learn about it.

ska9y ago

Chinjut9y ago

lordnacho9y ago

It's possible he means Lp norms?

https://en.wikipedia.org/wiki/Norm_(mathematics)

dnautics9y ago

hah whoops! I did confuse inner products with norms. But it is true that the L2 norm is the only one that survives transformations to arbitrary unit basis vectors.

eanzenberg9y ago· 4 in thread

Why squared error? Because you can solve the equation to minimize squared error using linear algebra in closed form.

Why L2 regularization? Same reason. A closed form solution exists from linear algebra.

But at the end of the day, you are most interested in the expectation value of the coefficient and minimizing the squared error gives you E[coeffs] which is the mean of the coefficients.

bo10249y ago

I don't think this is any more convincing than the article's reasons. There are closed forms to lots of things that aren't interesting.

srean9y ago

I cannot speak for eanzenberg but I think his comment was less about his personal justification and more about the rationalizations that have been used in the history of stats.

3 more replies

eanzenberg9y ago

However it is useful to have a closed form solution because it guarantees you actually minimized it. Other strategies to minimize functions don't guarantee that but they're still extremely useful.

lottin9y ago

> Because you can solve the equation to minimize squared error using linear algebra in closed form.

Exactly right. It has nothing to do with probability distributions.

tvural9y ago· 4 in thread

The best explanation is probably that squared error gives you the best fit when you assume your errors should normally distributed.

Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.

highd9y ago

"if the best model is not differentiable, you should still use it."

gabrielgoh9y ago

A few misconceptions I should correct in this comment.

2 more replies

throw_away_7779y ago

eanzenberg9y ago

Regardless of how distributed the errors are, the squared error fit will provide the expectation value of the variable, which is the mean. It will say nothing of the error of the mean it calculates.

shawnz9y ago· 4 in thread

robotresearcher9y ago

> This causes one big error to be more significant than many small errors,

That's correct.

> which is usually what you want

Unless you have outliers, in which case it's what you don't want. So you add e.g. a Huber loss function to reach a compromise.

dajohnson899y ago

tomp9y ago

But squared error is easier to compute. So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

amelius9y ago

> So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

But if you are say fitting a function to the data, you can't tell beforehand which data-points are the outliers. So in that case perhaps you need an iterative approach of removing them (?)

gpsx9y ago· 3 in thread

klodolph9y ago

That's a summary of the article.

gpsx9y ago

Yes, sort of. But I think he says a lot of unnecessary things not getting at the root of the issue.

I left out some detail I should have said, like what is so special about a gaussian that makes the math easy. So I will say it.

tpeo9y ago

I'm sorry, but what do you mean by "decay"?

You're talking about fat tails?

graycat9y ago· 3 in thread

I asked that early in my career.

We want a metric essentially because if we converge or have a good approximation in the metric then we are close in some important respects.

Squared error, then, gives one such metric.

But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc.

With Hilbert space, commonly we can write the total error as a sum of contributions from orthogonal components, that is, decompose the error into contributions from those components -- nice.

The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal representation and decomposition, best squared error approximation.

We also like Fourier theory with squared error because of how it gives us the Heisenberg uncertainty principle.

Under meager assumptions, for real valued random variables X and Y, E[Y|X], a function of X, is the best squared error approximation of Y by a function of X.

For more, convergence in squared error can imply convergence almost surely at least for a subsequence.

Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice.

srean9y ago

Ah but square error is not a metric, its square root is a metric.

Very interesting stuff (at least to me)

graycat9y ago

> Ah but square error is not a metric, its square root is a metric.

Yes, I was using "squared error" because the OP was. What I wrote was modulo a square root missing here and there!

neutralid9y ago

This is very interesting. Thank you.

What book would you recommend for this discussion?

2 more replies

fiatjaf9y ago· 3 in thread

Why geometric mean?, I would ask.

eanzenberg9y ago

Because it's very useful for symmetric distributions. If the distribution is highly non-symmetric, then the mean != maximum likelihood, which is probabilistic.

thomasahle9y ago

"Why addition?", I would ask.

Different problems, different tools. You can't ask "why geometric mean?" without referring to a specific problem you're trying to solve.

fiatjaf9y ago

What is a problem geometric mean solve? That was my question the entire time.

When people ask "why machine learning?" the answers are "machine learning can do these things blablabla", not "you must specify the problem you're trying to solve".

2 more replies

dschiptsov9y ago· 3 in thread

To make it positive and to amplify it (as a side-effect).

BTW, "error" is a misleading term - it communicates some fault, at least in the common sense. Distance would be much better term.

So, "squared distance" makes much more sense, because negative distance is nonsense.

tonyedgecombe9y ago

Well it will only amplify values > 1.

esrauch9y ago

That's not correct. Even though the magnitudes of the value in isolation shrinks, the relative magnitudes are still amplified which is what matters.

1 more reply

klodolph9y ago

That's a compelling case for why we should not use "distance", because distance cannot be negative, but the error term can.

Just look at bog-standard linear regressions, say Y_i = m X_i + b + ε_i. It makes no sense to call the ε_i terms "distance".

j7ake9y ago· 2 in thread

The Bayesian formulation for the likelihood function would make this squared error explicitly clear.

stared9y ago

For Gaussian uncertainty. Which still makes it a much more natural assumption than any other I know.

klodolph9y ago

Bayesian formulations are not necessary, the Gaussian is the maximum entropy distribution for known mean and squared error.

jostmey9y ago· 2 in thread

Why not KL-Divergence, which measures the error between a target distribution and the current distribution? From the perspective of Information Theory, it is the best error measurement.

Oh, and let's not forget that for a lot of problems minimizing the KL-divergence is the exact same operation as maximizing the likelihood function.

enthdegree9y ago

kl divergence has no nice theoretical properties other than 'it is the answer to these questions'

it is also extremely poorly behaved numerically and in convergence

srean9y ago

I am sorry but I have to call bullshit on this.

BTW using KL on equivariant Gaussian gives you square loss, apparently the loss you prefer.

1 more reply

thomasahle9y ago· 1 in thread

It's fine to list some reasons for using squared error, but you really can't decide on the error function without referring to a problem you're trying to solve.

Just look at the success of compressed sensing, based on taking the absolute value error seriously.

Sean17089y ago

Which is basically the entire message of the last section.

heisenbit9y ago· 1 in thread

Square often corresponds to power in systems.

heisenbit9y ago

I noticed this got voted up and down more than usual. Maybe a little elaboration:

Square often corresponds to power/energy in systems AND energy (integral of power) is preserved. That relationship between physics and math allows a lot of useful transformations.

kazinator9y ago

TeMPOraL9y ago

My explanation for squared error in linear approximation always was: because it minimizes the thickness of the line that passes through all the data points.

(Per the old math joke - you can make a line passing through any three points on a plane if you make it thick enough.)

theophrastus9y ago

[1] https://en.wikipedia.org/wiki/Variance#Properties

bagrow9y ago

Interesting discussion. Not sure about the breakdown between ridge regression and LASSO though. The difference is not in the error term but in the regularization term.

thisrod9y ago

Normally distributed variables because the central limit theorem.

It isn't all that complicated.

highd9y ago

adamzerner9y ago

Also see http://www.leeds.ac.uk/educol/documents/00003759.htm.

redcalx9y ago

Somewhat related; here's my attempt at explaining Cross Entropy:

http://heliosphan.org/cross-entropy.html

jayajay9y ago

cause linear algebra is a beautiful framework to think in.

j / k navigate · click thread line to collapse