They can definitely read HTML, but they do better with more structure. I proposed in a sibling comment for example that the "reader mode" feature in browsers might be a great LLM-compatibility feature to reduce all the HTML token noise. Or exposing an HTTP API with an OpenAPI schema and a proper sitemap and an RSS feed. For example fetching from an RSS feed can be exposed to the LLM as a "tool" that it can call.