You might as well write a tool that extracts strings from a video signal using OCR, and translates them. That would make the solution more universal, and you could even use it to e.g. suppress ads.
I'm not well-versed on the subject, how does encoding come into play for text displayed on the screen? Did they use a strange way of representing the Japanese text because of technical limitations?