DALL-E 2 is different from what I've seen. The things it produces seem to actually make sense the majority of the time. The outputs are strikingly similar to what a competent human might output as opposed to one with a severe mental illness.
I'm sure part of this is an inherent advantage that DALL-E enjoys regarding context. Art is supposed to be artistic whereas text is expected to maintain long distance logical consistency of abstract concepts across a stream of output and also to communicate something concrete. So in a sense the bar for art is probably lower in many ways.