Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?
My hunch is that the fact that HTML has explicit matching closing tags makes it a bit easier for an LLM to understand structure, whereas markdown tends to lean heavily on line breaks. That works great when you’re viewing the text as a two dimensional field of pixels, but that’s not how LLMs see the world.
But I think the difference is fairly marginal, and my hunch should be taken with a grain of salt. From experience, all I can say is that I’ve seen stripped down HTML work fine, and I’ve seen markdown work fine. The one place where markdown clearly shines is that it tends to use fewer tokens.
It's clear enough that they can use and consume markdown, but is the suggestion here that they've seen more markdown than xml?
I'd have guessed possibly naively that they fed in more straight html but I'd be interested to know why that's unlikely to be the case
A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.
ScrollSets are basically "deconstructed CSVs".
Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`
[0]: https://val.town
Thanks for the bug report !
Any specific guidelines for contributing? I see that you're open to contributions.
One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).
Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!
And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..
A few thoughts:
1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.
2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)
3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.
4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.
5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).
6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.
7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)
1. there some crazy links with lots of arguments and tracking stuff in them, so it gets very long, the refification turns them into a numbered "ref[n]" scheme, where you also get a map of ref[n]->url to do reverse translation.. it really saves a lot, in my experience. It's also optional, so you can be mindful when you want to use this feature..
2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).
3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help
4. There are some built in functions to traverse and modify the AST. It is just JSON in the end of the day so you could leverage the types and write your own logic to parse it, as long as it conforms to the format you can always serialize it, as you mentioned..
5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future
6. Sounds cool but out of scope at the moment :)
7. This feature would serve to help with scraping and kind of point the LLM to some element? Then the part I'm missing is how you would code this in advance.. There could be some meta-data tag you could add and it would be taken through the pipeline and added on the other side to the generated elements in some way..
I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.
The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.
The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.
I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?
——
The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.