I feel like you're conflating quality with fidelity. Video generation models have better fidelity than they did a year ago, but they are no closer to producing any kind of compelling content without a human directing them, and the latter is what you would actually need to make the "infinite entertainment machine" happen.
The fidelity of a video generation model is comparable to an LLMs ability to nail spelling and grammar - it's a start, but there's more to being an author than that.