Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.
>but means all of your documents require substantially more storage than they used to.
Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.