This paper which received an honorable mention this year from NeurIPS conference first attempts to convert the image to a 3d scene before detecting objects. https://arxiv.org/pdf/1906.01618.pdf
So, "construct an internal 3D model of [something in the natural world]" will always be deficient, and any conclusions derived from these models will always have inherent errors (even before you get to algo bias). Self-driving cars, airport face-reading gates, Pixar blockbusters...their models can never represent reality in anything but a temporarily-convincing way. Those that affect policy and peoples' lives (aka not-entertainment) will always come up with the wrong conclusion sometimes, sometimes fatally.
That is incorrect. From the Wikipedia page you references:
as the appearance of a robot is made more human, some observers' emotional response to the robot becomes increasingly positive and empathetic, until it reaches a point beyond which the response quickly becomes strong revulsion. However, as the robot's appearance continues to become less distinguishable from a human being, the emotional response becomes positive once again and approaches human-to-human empathy levels
> The thing about AI/CV and other interpretation simulators is that there is always a quantization of nature in the end result.
That's just not true in any meaningful sense. Computer vision can process images with a higher resolution than the human eye can distinguish.
> So, "construct an internal 3D model of [something in the natural world]" will always be deficient, and any conclusions derived from these models will always have inherent errors
Humans do this too (hence optical illusions). There's no reason to think that machine models can't surpass human models (and in some domains they already do).
All models are wrong, but some are useful.
It is certainly an interesting problem though. I can't talk much about my work but if anyone wants to collaborate on something open source addressing the core problem check my profile.
[0]: http://www.aiskyeye.com/upfile/Vision_Meets_Drones_A_Challen...
Maybe a new benchmark like this is what we need to get out of the rut.
[0]: delivery.acm.org/10.1145/3330000/3321441/p177-Barham.pdf
It seems to be how the human works. I can rotate an object in my mind and picture it from any angle.
I think humans use both strategies. Sometimes we really rely on superficial visual information, like yellow/black stripes -> time to get away! No need to first perfectly match the visual input to a mental tiger model rotated at the correct orientation. I think split-second recognition is usually like this. Or perhaps we use different strategies for different objects, I could imagine for example that facial recognition in the brain is more 2D-feature based pattern matching, rather than 3D reconstruction.
You can, but that doesn't seem to me to be how the mind works. If it looked at it from every angle simultaneously there would be no speed difference in my recognising an object regardless of the orientation.
But in some cases there are a huge speed difference. I can be staring at it for many seconds and then snap! - oh it's upside down. As soon as I realise this my mind immediately adapts and what was unrecognisable is suddenly as plain as day.
The only way I can account for this is when I finally twig the image is upside down I re-route it through a different path in my brain that does a rotation before feeding it to the recogniser. But normally that path is shut off - it's not constantly scanning the input.
I suppose what happens is in most cases some pre-processor uses other clues in the picture to tell me it's rotated from it's normal position and engages the correct path without conscious intervention. That would explain why most of the time you say you don't notice it.
Nonetheless the two mechanisms are very different. A sequential path that rotation -> recognition will be slightly deeper and slightly slower than that does both in the one step, but far smaller. Nonetheless, to looks to me modern designs do attempt to do it in one step, which is to say they attempt to recognise the object in all possible orientations simultaneously.
For me, I actually imagine these objects moving/rotating to make sense of them when seen from unusual angles. That hammer you described, I look at it and imagine myself flexing it.
Except there was a wonderful thread about Aphantasia a while ago https://news.ycombinator.com/item?id=20267445
Several HN readers chimed in to say they have this feature. It would be interesting to know if this dataset would stump them.
My guess is every "hammer" image in the training data set was "conventional" -- a convenient angle and orientation. If half the images of "hammers" were instead "unconventional", would the model adapt to realize "my existing model of a hammer is incomplete; there must be a way to consolidate these two different images"?
Or does this require an internal 3d modeling, and better inputs wouldn't help; instead the model itself would need to be more advanced?
Nevertheless, I agree with you. Given a huge dataset with millions of of unconventional images may be enough. Who knows.
Things kind of go in cycles in machine learning (similar to other fields). There is nowadays growing dissatisfaction of having to use so much (labeled) data, and people want the models to be better "primed" to capture the variations and structures existing in the real world. Partially because labeling a lot of data is just very expensive, but partially it's also seen as inelegant and "black-boxy" or it's just not in their scientific taste.
Other people argue that learning it all from data is fine and this kind of robustness shouldn't have to be baked in to models. Rather they should/could be learned from vast amounts of unlabeled data instead (Yann LeCun seems to be in this group.), with unsupervised/self-supervised methods.
Curious if there's supporting reasoning for this type of statement? Imho most of these "objects" should still be learnable with vanilla CNNs if you had sufficient data, especially more angles. Starving a vision network of data is an interesting problem, but I don't think it can be used as a blanket statement for all state of the art techniques. And if I'm allowed to make a naive comparison to human intelligence, I don't think lack of viewing angles is a factor.
It would be nice at least if the computer tools could detect that they are confused, and I know there is some research in that direction.
When a model is trained on ImageNet the training dataset is (usually) enlarged by doing artificial image augmentation. This does things like rotate, skew, crop and recolor the images so the model understands what the object can look like.
This dataset appears to find angles of objects that are difficult to reproduce using this process.
That is useful, but I can think of two ways to solve this pretty easily that would be achievable and would make a good project for a undergrad or Masters student.
1) Acquire 3D models of each of the object classes, render them at multiple angles in Unreal (or similar) and augment the ImageNet dataset with these images
2) Assuming you want to use the whole ObjectNet dataset as a test set, follow their dataset construction process using Mechanical Turk, and train on that data.
I bet either of these processes would take back 20-30% of the 45% performance drop very easily, and I bet the ones left would be the ones that humans have a lot of trouble identifying.
Generally it improves some, but most of the gains are in the final layer retraining.
But it's a lot more data hungry.
> They do something like this, in section 4.3 by splitting ObjectNet in half.... here's still a gap, but it can be plausibly argued that the gap would close up if we scaled things up by one or two orders of magnitude
This is interesting. It's worth noting that this training is on only 64 images per class, and it is unclear if they augment this in anyway.
Before retraining, the paper itself notes:
Classes such as plunger, safety pin and drill have 60-80% accuracy, while French press, pitcher, and plate have accuracies under 5%
It is worth noting that the plunger, safety pin and drill classes are ones that have multiple orientations already in ImageNet, while French press, pitcher, and plate are almost all the "right" way up.
To me this indicates this is simply a data problem - the model has never seen what an upside-down French press looks like so it gets it wrong.
Or is the consensus that it is a matter of time before compute and algorithms make these situations "safe enough," even for edge cases?