The Unreasonable Effectiveness of Deep Feature Extraction (opens in new tab)

(basilica.ai)

324 pointshiphipjorge7y ago55 comments

55 comments

46 comments · 13 top-level

mlucy7y ago· 9 in thread

Hi everyone! Author here. Let me know if you have any questions, this is one of my favorite subjects in the world to talk about.

skybrian7y ago

What do you think of the BagNet paper? It sounds like the important thing for image recognition is just coming up with local features?

https://openreview.net/forum?id=SkfMWhAqYQ

mlucy7y ago

I hadn't read it before! That's a fascinating result, actually. They emphasize interpretability in the paper, but I find it more interesting that you can do so well with only local information.

My first thought is that it makes sense that averaging together a bunch of local predictions would work well on the ImageNet task, since the different classes tend to have obviously different local textures, and class-relevant information makes up a large part of the image. I would be very curious to see if the technique is as competitive for other tasks.

1 more reply

yazr7y ago

I come from Deep reinforcement learning. When considering simulated environments (such as AlphaZero, AlphaStar), can feature engineering dramatically improve the cpu-requirement or sample-efficiency ?

Or are low-level features the "easiest" part for the network to learn?

Edit1 : I understand of course the academic purity of working from raw data.

Edit2: so simulated means lots of samples, on policy learning, but also very cpu intensive.

fouc7y ago

What do you think are the most interesting types of problems to solve with this?

mlucy7y ago

I think if you have a small to medium sized dataset of images or text, deep feature extraction would be the first thing I'd try.

I'm not sure what the most interesting problems with that property are. Maybe making specialized classifiers for people based on personal labeling? I've always wanted e.g. a twitter filter that excludes specifically the tweets that I don't want to read from my stream.

1 more reply

asavinov7y ago

IMHO (deep) feature engineering is important in these cases:

o the lower the level of representation the more important it is to increase the level of abstraction by learning or defining manually new features

o in the presence of (fine-grained) raster (automated) feature engineering is especially important. Therefore, feature engineering is important in audio analysis (1d raster) and video analysis (2d raster).

julius_set7y ago

Great article, I have a question pertaining to Time series data. Would this work well on a smaller dataset of pre processed sensor readings for HAR?

mlucy7y ago

I don't work with time series data much myself. I would imagine you can get at least some transfer learning, since there are patterns that show up across different domains. It looks like there's been a little bit of work done on this: https://arxiv.org/pdf/1811.01533.pdf .

According to them, transfer learning can improve a time series model if you pick the right dataset to transfer from, but they don't seem to be getting the same unbelievably strong transfer results that you'd see on images and text.

jewelthief917y ago

Considering the rate of change in this field, what would be beneficial to learn for people who don't actually get to use machine learning in their day to day job? I'd love to dive in and learn more about machine learning but I don't want to waste time learning something that will be totally irrelevant in a couple years.

fouc7y ago· 5 in thread

>But in the future, I think ML will look more like a tower of transfer learning. You'll have a sequence of models, each of which specializes the previous model, which was trained on a more general task with more data available.

He's almost describing a future where we might buy/license pre-trained models from Google/Facebook/etc that are trained on huge datasets, and then extend that with more specific training from other sources of data in order to end up with a model suited to the problem being solved.

It also sounds like we can feed the model's learnings back into new models with new architectures as well as we discover better approaches later.

mlucy7y ago

> He's almost describing a future where we might buy/license pre-trained models from Google/Facebook/etc that are trained on huge datasets, and then extend that with more specific training from other sources of data in order to end up with a model suited to the problem being solved.

Yup, that's basically it. (Although I think there might be more than two parties involved; I think probably there will be one giant pretrained image model that everyone in the world starts from, then someone will specialize it for some domain, then someone will specialize that for some subdomain, all the way down to an individual person's problem, which might only have a few thousand data points.)

XuMiao7y ago

What do you think of life-long learning scenario that models are trained incrementally forever? For example, I train a model with 1000 examples, it sucks. The next guy pick it up and train a new one by putting a regularizer over mine. It might still suck. But after maybe 1000 people, the model begins to get significantly better. Now, I will pickup what I left and improve it by leveraging the current best. This continues forever. Imagine that this community is supported by a block chain. We won't be relying on big companies any more eventually.

jacquesm7y ago

What is it with the word 'blockchain' that will make people toss it into otherwise completely unrelated text?

3 more replies

Terr_7y ago

> What do you think of life-long learning scenario that models are trained incrementally forever?

The same as the "life-long" coding scenario where monoliths are tweaked incrementally forever.

They may have niches but they'll kinda suck, because the underlying problem-space evolves too. Code loses value with age.

gipp7y ago

Not sure if you were just being cheeky, but this is pretty much exactly what GCP's AutoML offerings are.

zackmorris7y ago· 5 in thread

From the article:

Where are things headed?

There's a growing consensus that deep learning is going to be a centralizing technology rather than a decentralizing one. We seem to be headed toward a world where the only people with enough data and compute to train truly state-of-the-art networks are a handful of large tech companies.

This is terrifying, but the same conclusion that I've come to.

I'm starting to feel more and more dread that this isn't how the future was supposed to be. I used to be so passionate about technology, especially about AI as the last solution in computer science.

But anymore, the most likely scenario I see for myself is moving out into the desert like OB1 Kenobi. I'm just, so weary. So unbelievably weary, day by day, in ever increasing ways.

coffeemug7y ago

Hey, I hope you don't take it the wrong way -- I'm coming from a place where I hope you start feeling better -- but what you're experiencing might be depression/mood affiliation. I.e. you feel weary and bleak, so the world seems weary and bleak.

There are enormous problems for humanity to solve, but that has always been the case. From plagues and famines, to world wars, to now climate change, AI risk, and maybe technology centralization. We've solved massive problems before at unbelievable odds, and I want to think we'll do it again. And if not, what of it? What else is there to do but work tirelessly at attempting to solve them?

I hope you feel better, and find help if you need it -- don't mean to presume too much. My e-mail is in my profile if you (or anyone else) needs someone to talk to.

guelo7y ago

It seems kind of obvious in retrospect. I used to envision "the singularity" as somehow organically emerging from distributed technology and that would make it benevolent. But that was so naive. The singularity was always going to require massive investments of a scale that only monopolies or militaries can provide. That it currently looks like it will come out of ad-tech monopolies comfortable with psychological manipulation at a global scale is the most terrifying possibility of all.

existencebox7y ago

I'm torn.

On one hand, I absolutely see the logic, feel the occasional despair, and tend to agree with you, especially when it comes to economies of scale. I'll never write algos that'll detect the alpha hedge funds can. I'll never write the NLP that my own employer can leverage trivially.

On the other hand, do I really need to? In 90% of the use cases where I want to solve a problem, with some pile of hacks and heuristics I've gotten "more than good enough." And the big companies will keep investing on ways to scale up and optimize these algos, which will only benefit us tiny users too. I did both my last publication and patent using a CPU-bound model and not an ounce of deep learning, with a corpus you could fit on a thumbdrive.

I've watched a bigCo spend _months_ of some of the best engineers I know to optimize a tiny subproblem of a subproblem. (object similarity detection) Meanwhile I had to solve an isomorphism for my home camera system, threw together a prototype in a few hours with openCV and _really_ rudimentary bit-array hacks, declared it "WORKABLE" and have been using it for the last 3 years. There are some areas where what's in open source is pretty much what I'd use given any options. (Pandas, Spark, postgres) and some areas where it's not (pgadmin :P and OS UX (looking at you canonical) to name two). This isn't a one sided battle, to the strength and credit of non-big-corps.

Maybe it's the eternal rebel in me, but I'm a fan of desert kenobi, it's the start of a journey. Stick it to the man!

patcon7y ago

I recall this was the top-voted comment (rightfully imho) until shortly after someone suggested that the author was depressed instead of being reasonable... Now it's at the bottom, and lots of "wow, this is super-interesting" comments are above.

fwiw, this quoted bit also jumped out at me as perhaps the most important note

I never know what to make of the psychology of this/my community (tech, specifically), but the dynamics here on HN always provide me lots of "food for thought" to overfit :)

rstuart41337y ago

I took the opposite conclusion from it.

After doing a back of the envelope calculation and concluding AlphaGo Zero 250MW hours of power I concluded we were going to see AI monopolies. Someone would develop the best imagine recognition, corner the market and get a stream of profits that allowed them to pour more money into training and round and round we go. As you say it was a depressing thought.

If deep feature learning / embedding is really this effective it turns it upside down. We could end up with a AI being constructed from layers you buy off the shelf. Lots of parts coming from different vendors, all competing - not unlike the software stacks we use now.

It might even go the MJPEG / AV1 way - large companies get the shits with paying someone for the layers and combine their resources to build a better layer than any one of them could on their own, and they can do it because they are all going do additional training on top and put it to a different use.

This is impossibly speculative - but thinking you had to invest in 250MW Hours go build a decent Go machine was equally speculative and not nice. No open source group was going to come up with the next killer Go playing box if turned out that way. Now there is a possibility it may not.

asavinov7y ago· 3 in thread

Deep feature extraction is important for not only image analysis but also in other areas where specialized tools might be useful such as listed below:

o https://github.com/Featuretools/featuretools - Automated feature engineering with main focus on relational structures and deep feature synthesis

o https://github.com/blue-yonder/tsfresh - Automatic extraction of relevant features from time series

o https://github.com/machinalis/featureforge - creating and testing machine learning features, with a scikit-learn compatible API

o https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last! The workflow engine allows for integrating feature training and data wrangling tasks with conventional ML

o https://github.com/xiaoganghan/awesome-feature-engineering - other resource related to feature engineering (video, audio, text)

mlucy7y ago

Definitely. There's been a lot of exciting work recently for text in particular, like https://arxiv.org/pdf/1810.04805.pdf .

nl7y ago

Or from today, OpenAI's response to BERT: https://blog.openai.com/better-language-models/

Breaks 70% accuracy on the Winograd schema for the first time! (a lazy 7% improvement in performance....)

psandersen7y ago

This is a great resource, thanks for sharing!

I'd be interested to hear what kind of experience people are having with these frameworks in production.

kieckerjan7y ago· 3 in thread

As the author acknowledges, we might be living in a window of opportunity where big data firms are giving something away for free that may yet turn out to be a big part of their furure IP. Grab it while you can.

On a tangent, I really like the tone of voice in this article. Wide eyed, optimistic and forward looking while at the same time knowledgeable and practical. (Thanks!)

gmac7y ago

big data firms are giving something away for free

On that note, does anyone know if state-of-the-art models trained on billions of images (such as Facebook's model trained via Instagram tags/images, mentioned in the post) are publicly available and, if so, where?

Everything I turn up with a brief Google seems to have been trained on ImageNet, which the post leads me to believe is now small and sub-par ...

hamilyon27y ago

Have you found anything?

1 more reply

chasely7y ago

I also found the writing to be engaging and informative. Not many product websites have posts that make me go back through their archive.

mikekchar7y ago· 3 in thread

It's hard to ask my question without sounding a bit naive :-) Back in the early nineties I did some work with convoluted neural nets, except that at that time we didn't call them "convoluted". They were just the neural nets that were not provably uninteresting :-) My biggest problem was that I didn't have enough hardware and so I put that kind of stuff on a shelf waiting for hardware to improve (which it did, but I never got back to that shelf).

What I find a bit strange is the excitement that's going on. I find a lot of these results pretty expected. Or at least this is what I and anybody I talked to at the time seemed to think would happen. Of course, the thing about science is that sometimes you have to do the boring work of seeing if it does, indeed, work like that. So while I've been glancing sidelong at the ML work going on, it's been mostly a checklist of "Oh cool. So it does work. I'm glad".

The excitement has really been catching me off guard, though. It's as if nobody else expected it to work like this. This in turn makes me wonder if I'm being stupidly naive. Normally I find when somebody thinks, "Oh it was obvious" it's because they had an oversimplified view of it and it just happened to superficially match with reality. I suspect that's the case with me :-)

For those doing research in the area (and I know there are some people here), what have been the biggest discoveries/hurdles that we've overcome in the last 20 or 30 years? In retrospect, what were the biggest worries you had in terms of wondering if it would work the way you thought it might? Going forward, what are the most obvious hurdles that, if they don't work out might slow down or halt our progression?

aabajian7y ago

If you haven't, you should take a few moments to read the original AlexNet paper (only 11 pages):

https://papers.nips.cc/paper/4824-imagenet-classification-wi...

What you're saying is true, it should have worked in theory, but it just wasn't working for decades. The AlexNet team made several critical optimizations to get it work: (a) big network, (b) training on GPU, and (c) using a ReLU instead of tanh(x).

In the end, it was the hardware that made it possible, but up until their paper it really wasn't for sure. A good analogy is the invention of the airplane. You can speculate all you want about the curvature of a bird's wing and lift, but until you actual build a wing that flies, it's all speculation.

dchichkov7y ago

We've learned to learn cost functions, instead of hardcoding. Discriminative models are always worriesome. Nuclear war, unexpected deterioration of democracy or unexpected and rapid change of climate.

pwbdecker7y ago

I feel the same way. I was working on NN research in the 2000s and spent all my time trying to optimize the performance to deal with non trivial data sets. I got as far as working on GPU implementations around 2008 before I moved on to other subjects, but seeing these results now is incredibly validating. There was no shortage of profs and grad students that scoffed at me at the time. I still kick myself for not keeping up with it

stared7y ago· 2 in thread

A few caveats here:

- It works (that well) only for vision (for language it sort-of-works only at the word level: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)

- "Do Better ImageNet Models Transfer Better?" https://arxiv.org/abs/1805.08974

And if you want to play with transfer learning, here is a tutorial with a working notebook: https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/

mlucy7y ago

There's actually been a lot of really good work recently around textual transfer learning. Google's BERT paper does sentence-level pretraining and transfer to get state of the art results on a bunch of problems: https://arxiv.org/pdf/1810.04805.pdf

stared7y ago

Thanks for this reference, I will look it up. Though, from my experience people in NLP still (be default) train from scratch, with some exceptions for tasks on the same dataset:

- https://blog.openai.com/unsupervised-sentiment-neuron/

- http://ruder.io/nlp-imagenet/

1 more reply

bobosha7y ago· 1 in thread

This is very interesting and timely to my work, I had been struggling with training a Mobilenet CNN for classification of human emotions ("in the wild"), and struggling to get the model to converge. I tried multiclass to binary models e.g. angry|not_angry etc. But couldn't get past the 60-70% accuracy range.

I switched to extracting features from Imagenet and trained an xgboost binary and boom...right out of the box am seeing ~88% accuracy.

Also the author's points about speed of training and flexibility is major plus for my work. Hope this helps others.

mlucy7y ago

Yeah, I think this pattern is pretty common. (Basilica's main business is an API that does deep feature extraction as a service, so we end up talking to a lot of people with tasks like yours -- and there are a lot of them.)

We're actually working on an image model specialized for human faces right now, since it's such a common problem and people usually don't have huge datasets.

jfries7y ago· 1 in thread

Very interesting article! It answered some questions I've had for a long time.

I'm curious about how this works in practice. Is it always good enough to take the outputs of the next-to-last layer as features? When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step? And the inputs to the net you're training is the features? Does the new net always only need 1 layer?

What are some examples of where this worked well (except for the flowers mentioned in the article)?

mlucy7y ago

> Is it always good enough to take the outputs of the next-to-last layer as features?

It usually doesn't matter all that much whether you take the next-to-last or the third from last, it all performs pretty similarly. If you're doing transfer to a task that's very dissimilar from the pretraining task, I think it can sometimes be helpful to take the first dense layer after the convolutional layers instead, but I can't seem to find the paper where I remember reading that, so take it with a grain of salt.

> When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step?

Yep. (And, crucially, you don't have to run them through again every iteration.)

> And the inputs to the net you're training is the features? Does the new net always only need 1 layer?

Yeah, you take the activations of the late layer of the pretrained net and use them as the input features to the new model you're training. The new model you're training can be as complicated as you like, but usually a simple linear model performs great.

> What are some examples of where this worked well (except for the flowers mentioned in the article)?

The first paper in the post (https://arxiv.org/abs/1403.6382) covers about a dozen different tasks.

gdubs7y ago· 1 in thread

This is probably naive, but I’m imagining something like the US Library of Congress providing these models in the future. E.g., some federally funded program to procure / create enormous data sets / train.

rsfern7y ago

I don’t think it’s that naive. NIST is actively getting into this space: https://www.nist.gov/topics/artificial-intelligence

al2o3cr7y ago

Contrast a similar writeup on some interesting observations about solving ImageNet with a network that only sees small patches (largest is 33px on a side)

https://medium.com/bethgelab/neural-networks-seem-to-follow-...

purplezooey7y ago

Question to me is, can you do this with i.e. Random Forest too, or is it specific to NN.

CMCDragonkai7y ago

I'm wondering how this compares to transfer learning applied to the same model. That is compare deep feature extraction plus linear model at the end vs just transferring the weights to the same model and retraining to your specific dataset.

j / k navigate · click thread line to collapse

55 comments

46 comments · 13 top-level

mlucy7y ago· 9 in thread

Hi everyone! Author here. Let me know if you have any questions, this is one of my favorite subjects in the world to talk about.

skybrian7y ago

What do you think of the BagNet paper? It sounds like the important thing for image recognition is just coming up with local features?

https://openreview.net/forum?id=SkfMWhAqYQ

mlucy7y ago

I hadn't read it before! That's a fascinating result, actually. They emphasize interpretability in the paper, but I find it more interesting that you can do so well with only local information.

1 more reply

yazr7y ago

Or are low-level features the "easiest" part for the network to learn?

Edit1 : I understand of course the academic purity of working from raw data.

Edit2: so simulated means lots of samples, on policy learning, but also very cpu intensive.

fouc7y ago

What do you think are the most interesting types of problems to solve with this?

mlucy7y ago

I think if you have a small to medium sized dataset of images or text, deep feature extraction would be the first thing I'd try.

1 more reply

asavinov7y ago

IMHO (deep) feature engineering is important in these cases:

o the lower the level of representation the more important it is to increase the level of abstraction by learning or defining manually new features

julius_set7y ago

Great article, I have a question pertaining to Time series data. Would this work well on a smaller dataset of pre processed sensor readings for HAR?

mlucy7y ago

jewelthief917y ago

fouc7y ago· 5 in thread

It also sounds like we can feed the model's learnings back into new models with new architectures as well as we discover better approaches later.

mlucy7y ago

XuMiao7y ago

jacquesm7y ago

What is it with the word 'blockchain' that will make people toss it into otherwise completely unrelated text?

3 more replies

Terr_7y ago

> What do you think of life-long learning scenario that models are trained incrementally forever?

The same as the "life-long" coding scenario where monoliths are tweaked incrementally forever.

They may have niches but they'll kinda suck, because the underlying problem-space evolves too. Code loses value with age.

gipp7y ago

Not sure if you were just being cheeky, but this is pretty much exactly what GCP's AutoML offerings are.

zackmorris7y ago· 5 in thread

From the article:

Where are things headed?

This is terrifying, but the same conclusion that I've come to.

I'm starting to feel more and more dread that this isn't how the future was supposed to be. I used to be so passionate about technology, especially about AI as the last solution in computer science.

But anymore, the most likely scenario I see for myself is moving out into the desert like OB1 Kenobi. I'm just, so weary. So unbelievably weary, day by day, in ever increasing ways.

coffeemug7y ago

I hope you feel better, and find help if you need it -- don't mean to presume too much. My e-mail is in my profile if you (or anyone else) needs someone to talk to.

guelo7y ago

existencebox7y ago

I'm torn.

Maybe it's the eternal rebel in me, but I'm a fan of desert kenobi, it's the start of a journey. Stick it to the man!

patcon7y ago

fwiw, this quoted bit also jumped out at me as perhaps the most important note

I never know what to make of the psychology of this/my community (tech, specifically), but the dynamics here on HN always provide me lots of "food for thought" to overfit :)

rstuart41337y ago

I took the opposite conclusion from it.

asavinov7y ago· 3 in thread

Deep feature extraction is important for not only image analysis but also in other areas where specialized tools might be useful such as listed below:

o https://github.com/Featuretools/featuretools - Automated feature engineering with main focus on relational structures and deep feature synthesis

o https://github.com/blue-yonder/tsfresh - Automatic extraction of relevant features from time series

o https://github.com/machinalis/featureforge - creating and testing machine learning features, with a scikit-learn compatible API

o https://github.com/xiaoganghan/awesome-feature-engineering - other resource related to feature engineering (video, audio, text)

mlucy7y ago

Definitely. There's been a lot of exciting work recently for text in particular, like https://arxiv.org/pdf/1810.04805.pdf .

nl7y ago

Or from today, OpenAI's response to BERT: https://blog.openai.com/better-language-models/

Breaks 70% accuracy on the Winograd schema for the first time! (a lazy 7% improvement in performance....)

psandersen7y ago

This is a great resource, thanks for sharing!

I'd be interested to hear what kind of experience people are having with these frameworks in production.

kieckerjan7y ago· 3 in thread

On a tangent, I really like the tone of voice in this article. Wide eyed, optimistic and forward looking while at the same time knowledgeable and practical. (Thanks!)

gmac7y ago

big data firms are giving something away for free

Everything I turn up with a brief Google seems to have been trained on ImageNet, which the post leads me to believe is now small and sub-par ...

hamilyon27y ago

Have you found anything?

1 more reply

chasely7y ago

I also found the writing to be engaging and informative. Not many product websites have posts that make me go back through their archive.

mikekchar7y ago· 3 in thread

aabajian7y ago

If you haven't, you should take a few moments to read the original AlexNet paper (only 11 pages):

https://papers.nips.cc/paper/4824-imagenet-classification-wi...

dchichkov7y ago

pwbdecker7y ago

stared7y ago· 2 in thread

A few caveats here:

- It works (that well) only for vision (for language it sort-of-works only at the word level: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)

- "Do Better ImageNet Models Transfer Better?" https://arxiv.org/abs/1805.08974

And if you want to play with transfer learning, here is a tutorial with a working notebook: https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/

mlucy7y ago

stared7y ago

Thanks for this reference, I will look it up. Though, from my experience people in NLP still (be default) train from scratch, with some exceptions for tasks on the same dataset:

- https://blog.openai.com/unsupervised-sentiment-neuron/

- http://ruder.io/nlp-imagenet/

1 more reply

bobosha7y ago· 1 in thread

I switched to extracting features from Imagenet and trained an xgboost binary and boom...right out of the box am seeing ~88% accuracy.

Also the author's points about speed of training and flexibility is major plus for my work. Hope this helps others.

mlucy7y ago

We're actually working on an image model specialized for human faces right now, since it's such a common problem and people usually don't have huge datasets.

jfries7y ago· 1 in thread

Very interesting article! It answered some questions I've had for a long time.

What are some examples of where this worked well (except for the flowers mentioned in the article)?

mlucy7y ago

> Is it always good enough to take the outputs of the next-to-last layer as features?

> When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step?

Yep. (And, crucially, you don't have to run them through again every iteration.)

> And the inputs to the net you're training is the features? Does the new net always only need 1 layer?

> What are some examples of where this worked well (except for the flowers mentioned in the article)?

The first paper in the post (https://arxiv.org/abs/1403.6382) covers about a dozen different tasks.

gdubs7y ago· 1 in thread

rsfern7y ago

I don’t think it’s that naive. NIST is actively getting into this space: https://www.nist.gov/topics/artificial-intelligence

al2o3cr7y ago

Contrast a similar writeup on some interesting observations about solving ImageNet with a network that only sees small patches (largest is 33px on a side)

https://medium.com/bethgelab/neural-networks-seem-to-follow-...

purplezooey7y ago

Question to me is, can you do this with i.e. Random Forest too, or is it specific to NN.

CMCDragonkai7y ago

j / k navigate · click thread line to collapse