Greg has specifically said it's not an SSML-parsing text model; he's said it's an end to end multimodal model.
FWIW, I would find it very surprising if you could get the low latency expressiveness, singing, harmonizing, sarcasm and interpretation of incoming voice through SSML -- that would be a couple orders of magnitude better than any SSML product I've seen.