Realtime translation, the art of listening or reading in one language, and typing or speaking in another, takes many years [citation needed, uneducated guess] and can only begin once someone is incredibly skilled and experienced at the languages they're translating.
Reading subtitles in a language you know well, and trying to associate that with foreign sounds that you're hearing in constant time, is... theoretically a fun-sounding idea, but IMO practically requires the skillset that only comes with being an experienced realtime translator; in other words, it requires the realtime analysis capability to already exist. It cannot bootstrap it, I don't think.
I've tried to do this once, and immediately wrote it off as something that would be frustrating to seriously attempt for the reason I have mentioned. I might be utterly wrong (I have learning difficulties as a side-effect of other issues), so YMMV, grain of salt, etc etc. But I think that, at the very least, not having subtitles makes the brain work that much harder to figure out what it's hearing, AND it removes the noise of "hey! look! instantly parseable text right there on the screen!" which the brain must work incredibly hard to ignore because it associates with neural paths that are significantly stronger than the fledgling routes being developed to learn the new language.