The idea is to filter out bias-introducing information with low relevance (like gender, appearance, or accent) and focus on the actual events that took place.
Otherwise the court starts to include elements of theatrics and objective truth starts to give way to how one presents their case, such as what sort of appearance litigants make. E.g., whenever they're speaking confidently or, say, stuttering nervously. While this can be relevant information (e.g. if someone refuses to look in the eyes it could be a sign one's lying), there are multitude of ways it can be deceiving (e.g. if someone refuses to look in the eyes it could be that they find eye contact generally uncomfortable, for example folks with anxiety disorders do that).
Presenting both litigants through a Vtuber-like interface that re-synthesizes voices, adjusts some patterns of speech (like replacing names with placeholders, or making language gender-neutral), reduces non-verbal signalling, and provides neutral appearances to both parties, feels like something that can make litigants, judge and juries all focus on the abstract ideas of what took place, potentially allowing for a more clear and neutral judgement.
But - of course - it's also perfectly possible that it would fail in some way I fail to foresee.