I don't think that argument holds water. Emoji could just as well have been encoded as markup. There were for instance long-established conventions of using strings starting with : and ; . Bulletin boards extended that to a convention using letters delimited by : for example :rolleyes: . Not to mention that those codes can be typed more efficiently than browsing in an Emoji Picker box.
Because emoji became characters, text rendering and font formats had to be extended to support them.
There are four different ways to encode emoji in OpenType 1.8:
* Apple uses embedded PNG
* Google uses embedded colour bitmaps
* Microsoft uses flat glyphs in different colours layered on top of one-another
* Adobe and Mozilla use embedded SVG.