My first thought is that it makes sense that averaging together a bunch of local predictions would work well on the ImageNet task, since the different classes tend to have obviously different local textures, and class-relevant information makes up a large part of the image. I would be very curious to see if the technique is as competitive for other tasks.
Or are low-level features the "easiest" part for the network to learn?
Edit1 : I understand of course the academic purity of working from raw data.
Edit2: so simulated means lots of samples, on policy learning, but also very cpu intensive.
I'm not sure what the most interesting problems with that property are. Maybe making specialized classifiers for people based on personal labeling? I've always wanted e.g. a twitter filter that excludes specifically the tweets that I don't want to read from my stream.
o the lower the level of representation the more important it is to increase the level of abstraction by learning or defining manually new features
o in the presence of (fine-grained) raster (automated) feature engineering is especially important. Therefore, feature engineering is important in audio analysis (1d raster) and video analysis (2d raster).
According to them, transfer learning can improve a time series model if you pick the right dataset to transfer from, but they don't seem to be getting the same unbelievably strong transfer results that you'd see on images and text.
He's almost describing a future where we might buy/license pre-trained models from Google/Facebook/etc that are trained on huge datasets, and then extend that with more specific training from other sources of data in order to end up with a model suited to the problem being solved.
It also sounds like we can feed the model's learnings back into new models with new architectures as well as we discover better approaches later.
Yup, that's basically it. (Although I think there might be more than two parties involved; I think probably there will be one giant pretrained image model that everyone in the world starts from, then someone will specialize it for some domain, then someone will specialize that for some subdomain, all the way down to an individual person's problem, which might only have a few thousand data points.)
The same as the "life-long" coding scenario where monoliths are tweaked incrementally forever.
They may have niches but they'll kinda suck, because the underlying problem-space evolves too. Code loses value with age.
Where are things headed?
There's a growing consensus that deep learning is going to be a centralizing technology rather than a decentralizing one. We seem to be headed toward a world where the only people with enough data and compute to train truly state-of-the-art networks are a handful of large tech companies.
This is terrifying, but the same conclusion that I've come to.
I'm starting to feel more and more dread that this isn't how the future was supposed to be. I used to be so passionate about technology, especially about AI as the last solution in computer science.
But anymore, the most likely scenario I see for myself is moving out into the desert like OB1 Kenobi. I'm just, so weary. So unbelievably weary, day by day, in ever increasing ways.
There are enormous problems for humanity to solve, but that has always been the case. From plagues and famines, to world wars, to now climate change, AI risk, and maybe technology centralization. We've solved massive problems before at unbelievable odds, and I want to think we'll do it again. And if not, what of it? What else is there to do but work tirelessly at attempting to solve them?
I hope you feel better, and find help if you need it -- don't mean to presume too much. My e-mail is in my profile if you (or anyone else) needs someone to talk to.
On one hand, I absolutely see the logic, feel the occasional despair, and tend to agree with you, especially when it comes to economies of scale. I'll never write algos that'll detect the alpha hedge funds can. I'll never write the NLP that my own employer can leverage trivially.
On the other hand, do I really need to? In 90% of the use cases where I want to solve a problem, with some pile of hacks and heuristics I've gotten "more than good enough." And the big companies will keep investing on ways to scale up and optimize these algos, which will only benefit us tiny users too. I did both my last publication and patent using a CPU-bound model and not an ounce of deep learning, with a corpus you could fit on a thumbdrive.
I've watched a bigCo spend _months_ of some of the best engineers I know to optimize a tiny subproblem of a subproblem. (object similarity detection) Meanwhile I had to solve an isomorphism for my home camera system, threw together a prototype in a few hours with openCV and _really_ rudimentary bit-array hacks, declared it "WORKABLE" and have been using it for the last 3 years. There are some areas where what's in open source is pretty much what I'd use given any options. (Pandas, Spark, postgres) and some areas where it's not (pgadmin :P and OS UX (looking at you canonical) to name two). This isn't a one sided battle, to the strength and credit of non-big-corps.
Maybe it's the eternal rebel in me, but I'm a fan of desert kenobi, it's the start of a journey. Stick it to the man!
fwiw, this quoted bit also jumped out at me as perhaps the most important note
I never know what to make of the psychology of this/my community (tech, specifically), but the dynamics here on HN always provide me lots of "food for thought" to overfit :)
After doing a back of the envelope calculation and concluding AlphaGo Zero 250MW hours of power I concluded we were going to see AI monopolies. Someone would develop the best imagine recognition, corner the market and get a stream of profits that allowed them to pour more money into training and round and round we go. As you say it was a depressing thought.
If deep feature learning / embedding is really this effective it turns it upside down. We could end up with a AI being constructed from layers you buy off the shelf. Lots of parts coming from different vendors, all competing - not unlike the software stacks we use now.
It might even go the MJPEG / AV1 way - large companies get the shits with paying someone for the layers and combine their resources to build a better layer than any one of them could on their own, and they can do it because they are all going do additional training on top and put it to a different use.
This is impossibly speculative - but thinking you had to invest in 250MW Hours go build a decent Go machine was equally speculative and not nice. No open source group was going to come up with the next killer Go playing box if turned out that way. Now there is a possibility it may not.
o https://github.com/Featuretools/featuretools - Automated feature engineering with main focus on relational structures and deep feature synthesis
o https://github.com/blue-yonder/tsfresh - Automatic extraction of relevant features from time series
o https://github.com/machinalis/featureforge - creating and testing machine learning features, with a scikit-learn compatible API
o https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last! The workflow engine allows for integrating feature training and data wrangling tasks with conventional ML
o https://github.com/xiaoganghan/awesome-feature-engineering - other resource related to feature engineering (video, audio, text)
Breaks 70% accuracy on the Winograd schema for the first time! (a lazy 7% improvement in performance....)
I'd be interested to hear what kind of experience people are having with these frameworks in production.
On a tangent, I really like the tone of voice in this article. Wide eyed, optimistic and forward looking while at the same time knowledgeable and practical. (Thanks!)
On that note, does anyone know if state-of-the-art models trained on billions of images (such as Facebook's model trained via Instagram tags/images, mentioned in the post) are publicly available and, if so, where?
Everything I turn up with a brief Google seems to have been trained on ImageNet, which the post leads me to believe is now small and sub-par ...
What I find a bit strange is the excitement that's going on. I find a lot of these results pretty expected. Or at least this is what I and anybody I talked to at the time seemed to think would happen. Of course, the thing about science is that sometimes you have to do the boring work of seeing if it does, indeed, work like that. So while I've been glancing sidelong at the ML work going on, it's been mostly a checklist of "Oh cool. So it does work. I'm glad".
The excitement has really been catching me off guard, though. It's as if nobody else expected it to work like this. This in turn makes me wonder if I'm being stupidly naive. Normally I find when somebody thinks, "Oh it was obvious" it's because they had an oversimplified view of it and it just happened to superficially match with reality. I suspect that's the case with me :-)
For those doing research in the area (and I know there are some people here), what have been the biggest discoveries/hurdles that we've overcome in the last 20 or 30 years? In retrospect, what were the biggest worries you had in terms of wondering if it would work the way you thought it might? Going forward, what are the most obvious hurdles that, if they don't work out might slow down or halt our progression?
https://papers.nips.cc/paper/4824-imagenet-classification-wi...
What you're saying is true, it should have worked in theory, but it just wasn't working for decades. The AlexNet team made several critical optimizations to get it work: (a) big network, (b) training on GPU, and (c) using a ReLU instead of tanh(x).
In the end, it was the hardware that made it possible, but up until their paper it really wasn't for sure. A good analogy is the invention of the airplane. You can speculate all you want about the curvature of a bird's wing and lift, but until you actual build a wing that flies, it's all speculation.
- It works (that well) only for vision (for language it sort-of-works only at the word level: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)
- "Do Better ImageNet Models Transfer Better?" https://arxiv.org/abs/1805.08974
And if you want to play with transfer learning, here is a tutorial with a working notebook: https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/
I switched to extracting features from Imagenet and trained an xgboost binary and boom...right out of the box am seeing ~88% accuracy.
Also the author's points about speed of training and flexibility is major plus for my work. Hope this helps others.
We're actually working on an image model specialized for human faces right now, since it's such a common problem and people usually don't have huge datasets.
I'm curious about how this works in practice. Is it always good enough to take the outputs of the next-to-last layer as features? When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step? And the inputs to the net you're training is the features? Does the new net always only need 1 layer?
What are some examples of where this worked well (except for the flowers mentioned in the article)?
It usually doesn't matter all that much whether you take the next-to-last or the third from last, it all performs pretty similarly. If you're doing transfer to a task that's very dissimilar from the pretraining task, I think it can sometimes be helpful to take the first dense layer after the convolutional layers instead, but I can't seem to find the paper where I remember reading that, so take it with a grain of salt.
> When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step?
Yep. (And, crucially, you don't have to run them through again every iteration.)
> And the inputs to the net you're training is the features? Does the new net always only need 1 layer?
Yeah, you take the activations of the late layer of the pretrained net and use them as the input features to the new model you're training. The new model you're training can be as complicated as you like, but usually a simple linear model performs great.
> What are some examples of where this worked well (except for the flowers mentioned in the article)?
The first paper in the post (https://arxiv.org/abs/1403.6382) covers about a dozen different tasks.
https://medium.com/bethgelab/neural-networks-seem-to-follow-...