I would actually say that humans have an enormous extra set of data that we, as people, are "trained" on. We walk around in our daily lives, seeing things constantly, and that influences our perception of art. Art is always a product of the broader context it was made in (social, environmental, etc). Something that gets accepted or praised today might very well not have been 200 years ago.
One of the things that is interesting with these new big models is it is dramatically broadening the context in use. The models are learning both the textual representation of a concept, as well as the artistic/visual representation and the relationship between the two domains.