This sounds true, but it can't be the real reason—selfies are ranked relative to the other images by the same user. So unless users are taking a lot of #selfies of people of different genders, we can assume the dataset is already controlled for the gender of the person in the image, no? Unless there's some confounding factor at play, such as some demographic segment being more likely to optimize for good selfies occasionally but have boring feeds the rest of the time.
would be super interesting, if the data is available, to normalize this by exposure. Of the people that saw an image, how many clicked "like"?
> but it can't be the real reason
Can't? Ontop of the above-listed aspects it is entirely possible that there is a bias that both sexes find female appearance somewhat more aesthetically pleasing.
Similar to how focus group testing for computer voices tends to result in female voices being chosen (at least that's what I often hear, couldn't find a solid source).
Even if the bias is small the correlated factors would amplify it when you're optimizing for a maximum, i.e. for the top selection.
Discussion about this with the author reveals that I was misinterpreting how they were collecting averages. I was assuming the "like" count was coming from each photo collected, but instead they collected the photos and average likes in individual steps, where the average likes were across recent posts by that user, rather then the selfies by that user.
I personally prefer the Alex voice from Mac OS to female voices. It has nice intonation. If only I could make it correct some of the mistakes it makes, for example not being able to distinguish "read" in past tense from "read" in present tense which makes it sound silly. Another error it makes is confusing "live" as in "live concert" with "live" as in "live in USA" (they are called heteronyms and are a special case in TTS).
be female
be blonde
be attractive
Incidentally, Christian Rudder did a really good "study" on the dating site pictures a few years ago:Geoff Hinton had grad students who wanted to work on the problem, but Yann LeCun didn't.
"In about 2012, it should have been Yann's group, but Yann was unlucky, he didn't have a student who really wanted to do it. But we had a couple of students who wanted to do it and we took all of Yann's techniques and added some of our own."
How do we prevent our AIs from learning racism?
EDIT> Informative article, BTW. A good read.
I don't think these algorithms are learning racism. They are only being blunt in revealing what already exists.
That's why it's important to be clear about the question. This ConvNet doesn't really answer the question "What makes a good selfie". It answers a much narrower and more complicated to state question.
The absence of reflection in the system means that if it's used to answer a question that's superficially similar to the designer's intent, there's no way to reason around the bias in the training data.
Imagine I'm a Canadian who trains an automated turret to classify friend / foe based on data from Afghanistan and Iraq. I've not trained the system to answer "Is this group of pixels a friend / foe", in the general sense. If the system is used outside the narrow context of its validity, say in Northern Ireland, or in a civilian Muslim neighbourhood in Paris, we should expect bad results.
So you're right to point out that the racism is in the social context. But I'm arguing that we don't actually want a classifier to learn that if there's a good chance it'll be used in a way that discards or ignores that social context. Same as using an expert system outside its domain.
For example, if you train a CNN directly with human faces, its recognition rate comes way below what a human is capable of. Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability. Without much domain specific tuning, an AI's insight is far from reliable.
The example is correct, but not for the reasons stated. Humans are very, very good at face recognition. However, CNNs are pretty close to human performance for face detection.
Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability. Without much domain specific tuning, an AI's insight is far from reliable.
This just isn't the case. Take the GoogLeNet or VGGNet papers, build the CNN as described using Caffe/whatever, train as described in the paper and you'll end up with something that is pretty much on par with human performance for categorizing ImageNet images.
Take that same CNN architecture, and retrain it for another domain and it will perform roughly as well there too, for the task of categorizing into ~1K-10K image classes.
This isn't domain specific tuning. It's domain specific training, which is very different (although collecting the data is a big job).
Only after you apply tons of handcrafted optimizations, which are mostly black art, will you get close to or surpass a human's capability.
For CNNs, this is pretty much entirely false.
Those parameters are exactly the type of handcrafted optimizations I am talking about. You cannot just fill in arbitrary numbers and expect the network to fare well. In fact, you cannot even expect it to converge.
You can take those papers and build a world class classifier only because someone else has taken all the time to optimize for the specific case. Once you switch the task, the result will be OK, but nowhere close to what a human or a true AI would give you. Not until you take the time to optimize the parameters.
The state of the art I've read about* (deep CNNs) in later years rely more on generalized tricks like augmenting the training data (artificially inflating the data set), pre-training and fine-tuning, ReLU, regularization methods like dropout, etc.
For anyone interested, here [1] are some benchmarks.
* Late night here, but often in the vein of this [0] work.
[0]: https://www.cs.toronto.edu/~ranzato/publications/taigman_cvp...
[0] http://www.nbcnews.com/id/34482178/ns/health-skin_and_beauty...
Feed it an initial picture (noise, clouds, a selfie) and then backwards manipulate the input to maximize the assessed quality of the "selfie".
I guess that would look pretty funny.
https://en.wikipedia.org/wiki/Rule_of_thirds
(Our deep-learning framework http://deeplearning4j.org missed his list, but it's got working convnets, too.)
[Edit: Even better, he didn't use click data to train the model, just public likes.]
You can see it optimized the last selfie by cropping the face fully out of the picture.. :))