I agree, but I see this as a client-concern. I could imagine clients fetching and inlining images if the user directed it to, or maybe media-focused clients having a text pane alongside a media pane where the images would be rendered. The main advantage I see of this approach is that it takes us from "the server decides what the client does" to "the client decides what the client does".