In your cleanup step, after cleaning obvious junk, I think you should do whatever Firefox's reader mode does to further clean up, and if that fails bail out to the current output. That should reduce the number of tokens you send to the LLM even more
You should also have some way for the LLM to indicate there is no useful output because perhaps the page is supposed to be a SPA. This would force you to execute Javascript to render that particular page though