Not knowing basically anything about AI state of the art what stops us from feeding a RNN image data and text data and make it correlate them automatically by context? Just like a child learns words by hearing them many times in similar contexts so could a RNN.
I imagine the biggest problem is gathering and structuring the data. We humans receive lots of data and have lots of time to process it in our lives compared. And by lots I mean difference of a few orders of magnitude. It's amazing what this thing learns in just a few hours of processing.