Also it's not only the size of the dataset, it's also the size/variety in the label space. ImageNet is quite comprehensive, with many varied labels. MS COCO is quite biased towards a narrow ~hundred classes.
I'd love to see a properly large dataset of images "from the wild", with no restrictions on content (unlike what is done in MS COCO), annotated with sentences. From my experience with adding data to models in these situations I'm quite certain this would work _significantly_ better.