The problem is there's not a "conventional" binary encoding of Unicode strings. You can create the same exact character many different ways, from a single scalar value to a composition of multiple scalar values. There's also multiple different ways of ordering different pieces of a composite character that yield the same value. Would we not want to utilize a consistent decomposition and consistent logical segmentation for the string? It's no longer enough to iterate over a string one byte at a time and derive any meaning. Is it right that the Levenstein distance between 2 equal characters, "a" and "a", might be 12, simply because the joiners were ordered differently?
It seems like segmentation by grapheme cluster and comparison using a consistent normalization would provide the same logical answer as a classic byte-wise Levenshtein distance on an ASCII string. [1]
Or are you suggesting that's too high level and we should just consider this to be operating on bit strings that happen to be grouped into bytes, and not worry about the logical implications. Therefore we'd just use a consistent normalization form on both input strings, and it's okay that the distance is up to like 10-15 for a single character difference in a composite character and 1 in an ASCII character. That sounds totally reasonable too, just different.
[1] http://unicode.org/reports/tr15/