Is the point that the cost functions have incompatible gradients around local minima/different local minima?
I think that is part of it: the different cost functions can have different local minima and also different saddle points; ideally even different ridge/valley configurations.
In machine learning there is a well-known technique called Stochastic Gradient Descent (SGD) [1]. There the cost function is the sum of a very large number of terms reflecting how well each element of the training set has been reproduced.
With SGD the optimisation steps use randomly chosen cost functions which are obtained by choosing a random subset of the training set.
I had thought the advantage of SGD was purely in saved computation: by computing the approximate cost function on only a small batch you have only a tiny fraction of the computational expense. That if you could use larger batches it would always help the convergence.
This demo writeup makes me realize there may be a benefit from the randomness. Different cost functions may have different local minima, different saddles, different ridges. That helps you not get stuck or even slowed at these points.
[1] https://en.m.wikipedia.org/wiki/Stochastic_gradient_descent
I'm running a MSE hill climbing thing at the moment, I might give it a go and see if it helps.
This reminds me of the Pandora strategy where you don't every upvote anything, you only tell it no, to encourage it wandering the search space instead of orbiting tightly around a handful of songs.
Never tell people your favorite, or that is all you will get.
Just now out of 20 related/recommended slots: one 9/11 conspiracy, one Nuremberg trials? something about Nazi experiments, 6 watched videos I already gave a like few weeks/months/years ago! top 10 some gimmick artillery guns list, Belarus drunken driving compilation, 3 totally random videos in my native language despite YT UI being set to English, 4 videos actually related to the clip on the page, finally 2 videos somewhat related to my subscriptions.
That's actually not too far off what the DCT used in JPEG does, in that you're similarly trying to represent sampled data using a series of sinusoidal functions.
I wonder how well wavelet-based compression (like JPEG2000) would be for this data, since it's been used before in a demo too:
>JPEG encoding loses information. But it is JPEG decoding that introduces artifacts by filling the missing information with noise.
>jpeg2png is smarter and fills the missing information to create the smoothest possible picture.
https://web.archive.org/web/20160220090011/https://hyperalle...
Even though support isn't universal yet, I wonder how HEIC would fare here.
http://www.romancortes.com/blog/furbee-my-js1k-spring-13-ent...
They used a similar technique of creating an object out of an image.
The maps where called sculpts and consisted of an 64 by 64 image. Each pixels rgb value was mapped as xyz coodinate onto a grid. The grid base mesh was either a sphere or flat which allowed for a "closed" or "open" object surface so one could simulate holes (it also influenced the physics).
The users figured out a lot of funky tricks.
Like having multiple connected objects described by only one such map by having the connection be an ultra thin line which got not rendered. This lowered the "costs" Second Life calculated since it consisted of only one primitive.
Or using a bug to create giant ones of these things to create houses and other big objects.
On top having an automatic LOD was quite easy just half the number of grid vertices and there is your lower poly object.