Show HN: Convert HTML DOM to semantic markdown for use in LLMs (opens in new tab)

(github.com)

146 pointsleroman1y ago56 comments

56 comments

51 comments · 19 top-level

gmaster14401y ago· 8 in thread

> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.

Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

mistercow1y ago

I haven’t found any specific research, but I suspect it’s actually the opposite, particularly for models like Claude, which seem to have been specifically trained on XML-like structures.

My hunch is that the fact that HTML has explicit matching closing tags makes it a bit easier for an LLM to understand structure, whereas markdown tends to lean heavily on line breaks. That works great when you’re viewing the text as a two dimensional field of pixels, but that’s not how LLMs see the world.

But I think the difference is fairly marginal, and my hunch should be taken with a grain of salt. From experience, all I can say is that I’ve seen stripped down HTML work fine, and I’ve seen markdown work fine. The one place where markdown clearly shines is that it tends to use fewer tokens.

leromanOP1y ago

Author here- it's a good point to have some benchmarks (which I don't have..) but I think it's well understood that minimizing noise by reducing tokens will improve the quality of the answer. And I think by now LLMs are well versed in Markdown, as it's the preferred markup language used when generating responses

pseudosavant1y ago

My anecdotal experience is that Markdown usually does work better than HTML. I only leave it as HTML if the LLM needs to understand more about it than just the content, like the attributes on the elements (which would typically be a lot of noise, excess token input). I've found this to be especially true when using AI/LLMs in RAG scenarios.

mistercow1y ago

I wouldn’t be so sure on reducing tokens. Every token in context is space for the LLM to do more computation. Noise is obviously bad, because the computations will be irrelevant, but as long as your HTML is cleaned up, the extra tokens aren’t noise, but information about semantic structure.

leromanOP1y ago

Markdown being a very minimal Markup language has no need for much of the structural and presentational stuff (CSS, structural HTML), HTML has many many artifacts which are a huge bloat and give no semantic value IMO.. It's the goal here to capture any markup with semantic value, if you have examples this library might miss, you are welcome to share and I will look into it!

1 more reply

sigmoid101y ago

They understand best whatever was used during their training. For OpenAI's GPTs we don't really know since they don't disclose anything anymore, but there are good reasons to assume they used markdown or something closely related.

jddj1y ago

Just out of curiosity, what are some of those good reasons?

It's clear enough that they can use and consume markdown, but is the suggestion here that they've seen more markdown than xml?

I'd have guessed possibly naively that they fed in more straight html but I'd be interested to know why that's unlikely to be the case

sigmoid101y ago

Well, for one, their chat markup language (i.e. what they used for chat/instruction tuning). But they closed the source on that last year, so we don't know what it looks like anymore. I doubt it changed much though. Also, when you work with their models a lot for e.g. document processing, you'll find that markdown tends to work better in the context than, say, html. I've heard similar observations from people at other companies.

mistercow1y ago· 6 in thread

This is cool. When dealing with tables, you might want to explore departing from markdown. I’ve found that LLMs tend to struggle with tables that have large numbers of columns containing similar data types. Correlating a row is easy enough, because the data is all together, but connecting a cell back to its column becomes a counting task, which appears to be pretty rough.

A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.

michaelmior1y ago

SpreadsheetLLM[0] might be worth looking into. It's designed for Excel (and similar) spreadsheets, so I'd imagine you could do something far simpler for the majority of HTML tables.

[0] https://arxiv.org/abs/2407.09025v1

leromanOP1y ago

This is now supported, see here- https://github.com/romansky/dom-to-semantic-markdown?tab=rea...

msnkarthik1y ago

You're spot on about the challenges LLMs face with complex markdown tables, especially when column counts rise and data types are similar. The "counting task" for column correlation is a real pain point – it's like the LLM loses track of where it is in the data grid. Your ID/coordinate marker idea is clever! It provides explicit context that LLMs seem to crave. Using HTML comments for this metadata is an interesting approach. It keeps the markdown valid for human readability, but I share your uncertainty about how consistently LLMs would parse and utilize it. Some other avenues worth exploring: Alternative Formats: Have you experimented with formats like CSV or JSON for feeding tabular data to LLMs? They might offer a more structured representation that's easier to parse. Pre-processing: Could we pre-process the table to create a more LLM-friendly representation? For example, converting it into a list of dictionaries, where each dictionary represents a row and keys represent column names. Prompt Engineering: Perhaps there are specific prompts or instructions that can guide LLMs to better handle large tables within markdown. It seems like there's room for innovation in how we bridge the gap between human-readable markdown tables and the structured data LLMs thrive on.

mattding1y ago

Do you have any numbers re-markdown performance, or is this anecdotal? I'm running a similar experiment right now and would love to hear anything else you've tried.

breck1y ago

ScrollSets work really well for using LLMs to generate tables: https://sets.scroll.pub/

ScrollSets are basically "deconstructed CSVs".

leromanOP1y ago

Thanks for sharing, will look into adding this as a flag in the options!

throwthrowuknow1y ago· 4 in thread

Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!

leromanOP1y ago

You might find this useful- just added code & instructions on how to make it a global CLI utility- https://github.com/romansky/dom-to-semantic-markdown/blob/ma...

jejeyyy771y ago

hah, out of curiosity, what are you archiving and ingesting webpages for?

throwthrowuknow1y ago

Mostly for integration with my Obsidian vault so I don’t have to leave the app and can add notes and links and avoid linkrot.

zaSmilingIdiot1y ago

For personal use, and on the topic of Obsidian, I rolled my own form of this... Its quick and dirty, but generally works for my usecase. I tend to push a page through turndown [0] to generate the markdown, then write this into obsidian (also storing things link a copy of the rendered page, link to the source, etc).

[0] https://github.com/mixmark-io/turndown

nbbaier1y ago· 3 in thread

This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.

Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`

[0]: https://val.town

leromanOP1y ago

Afraid to say that other than bumping into a talk about Deno, I haven’t played around with it yet.. So thanks for the heads up, will look into it.

Thanks for the bug report !

nbbaier1y ago

Happy to also take a swing at it, but it would take me a bit because I've never added such compatibility to a library before.

Any specific guidelines for contributing? I see that you're open to contributions.

leromanOP1y ago

By all means, you can be the first contributor :) You are welcome to either open an issue and brain storm together on possible approaches or send me a pull request with what you came up with and we start there

1 more reply

richardreeze1y ago· 2 in thread

This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).

One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

leromanOP1y ago

Thanks for sharing!!

Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!

richardreeze1y ago

Done!

https://github.com/romansky/dom-to-semantic-markdown/issues/...

Zetaphor1y ago· 2 in thread

A browser demo would be a nice addition to this readme

leromanOP1y ago

Please see here- https://github.com/romansky/dom-to-semantic-markdown/blob/ma...

leromanOP1y ago

Ah, I suppose you mean a web page one could visit to see a demo :) Added to the backlog!

DevX1011y ago· 2 in thread

Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.

leromanOP1y ago

After removing the noise you can distill the semantic stuff where ever possible, like meta-deta from images, buttons, etc, and see some structures emerge like footers and nav and body.. And many times for the sake of SEO and accessibility, websites do adopt quite a bit of semantic HTML elements and annotations in respective tags..

goatlover1y ago

What happened to using the semantic elements? Did that fall out of favor or the push for it get abandoned because popular frameworks just generate divs with semantic classes (hopefully)?

la_fayette1y ago· 1 in thread

The scoring approach seems interesting to extract the main content of web pages. I am aware of the large body of decades of research on that subject, with sophisticated image or nlp based approaches. Since this extraction is critical to the quality of the LLM response, it would be good to know how well this performs. E.g., you could test it against a test dataset (https://github.com/scrapinghub/article-extraction-benchmark). Also, you could provide the option to plugin another extraction algorithm, since there are other implementations available... just some ideas for improvement...

leromanOP1y ago

This totally makes sense, I will look into adding support for additional ways to detect the main content, super interesting!

gradientDissent1y ago· 1 in thread

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.

leromanOP1y ago

Thank you! this is exactly why there's support for this specific use case- https://github.com/romansky/dom-to-semantic-markdown/blob/ma... (see `findContentByScoring`)

And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..

KolenCh1y ago· 1 in thread

I am curious how it would compare to using pandoc with readability algorithm for example.

leromanOP1y ago

Bumped this together with the side-by-side comparison task.. so will look into it :)

explosion-s1y ago· 1 in thread

How is this different than any other HTML to markdown library, like Showdown or Turndown? Is there any specific features that make it better for LLMS specifically instead of just converting HTML to MD?

leromanOP1y ago

Will add some side-by-side comparisons soon! the goal is not just to translate 1:1 HTML to markdown but to preserve any semantic information, this is generally not the goal for these tools. Some specific features and examples are in the README, like URL minification and optional main section detection and extraction (ignoring footer / header stuff).

ianbicking1y ago· 1 in thread

This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.

A few thoughts:

1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.

2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)

3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.

4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.

5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).

6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.

7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)

leromanOP1y ago

This is some great feedback, thanks!

1. there some crazy links with lots of arguments and tracking stuff in them, so it gets very long, the refification turns them into a numbered "ref[n]" scheme, where you also get a map of ref[n]->url to do reverse translation.. it really saves a lot, in my experience. It's also optional, so you can be mindful when you want to use this feature..

2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).

3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help

4. There are some built in functions to traverse and modify the AST. It is just JSON in the end of the day so you could leverage the types and write your own logic to parse it, as long as it conforms to the format you can always serialize it, as you mentioned..

5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future

6. Sounds cool but out of scope at the moment :)

7. This feature would serve to help with scraping and kind of point the LLM to some element? Then the part I'm missing is how you would code this in advance.. There could be some meta-data tag you could add and it would be taken through the pipeline and added on the other side to the generated elements in some way..

DeveloperErrata1y ago

It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).

I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

kartoolOz1y ago

WebArena does this really well, called the "accessibility_tree" https://github.com/web-arena-x/webarena/blob/main/browser_en...

nvartolomei1y ago

While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.

The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.

The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.

I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?

——

The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

alexliu5181y ago

Converting web pages to Markdown is a common requirement. I have found that turndown does a good job, but it cannot meet the needs of all dynamic web page content. As far as I know, if you need to process dynamic web pages, you need targeted adaptation, such as Google extensions such as Web2Markdown.

KolenCh1y ago

Does anyone compare the performance between HTML input and other formats? I did an informal comparison and from a few tests it seems the HTML input is better. I thought having markdown input would be more efficient too but I’d like to see more systematic comparison to see it is the case.

brightvegetable1y ago

This is great, I was just in need of something like this. Thank!

Layvier1y ago

Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?

j / k navigate · click thread line to collapse

56 comments

51 comments · 19 top-level

gmaster14401y ago· 8 in thread

> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.

Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

mistercow1y ago

I haven’t found any specific research, but I suspect it’s actually the opposite, particularly for models like Claude, which seem to have been specifically trained on XML-like structures.

leromanOP1y ago

pseudosavant1y ago

mistercow1y ago

leromanOP1y ago

1 more reply

sigmoid101y ago

jddj1y ago

Just out of curiosity, what are some of those good reasons?

It's clear enough that they can use and consume markdown, but is the suggestion here that they've seen more markdown than xml?

I'd have guessed possibly naively that they fed in more straight html but I'd be interested to know why that's unlikely to be the case

sigmoid101y ago

mistercow1y ago· 6 in thread

michaelmior1y ago

SpreadsheetLLM[0] might be worth looking into. It's designed for Excel (and similar) spreadsheets, so I'd imagine you could do something far simpler for the majority of HTML tables.

[0] https://arxiv.org/abs/2407.09025v1

leromanOP1y ago

This is now supported, see here- https://github.com/romansky/dom-to-semantic-markdown?tab=rea...

msnkarthik1y ago

mattding1y ago

Do you have any numbers re-markdown performance, or is this anecdotal? I'm running a similar experiment right now and would love to hear anything else you've tried.

breck1y ago

ScrollSets work really well for using LLMs to generate tables: https://sets.scroll.pub/

ScrollSets are basically "deconstructed CSVs".

leromanOP1y ago

Thanks for sharing, will look into adding this as a flag in the options!

throwthrowuknow1y ago· 4 in thread

Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!

leromanOP1y ago

You might find this useful- just added code & instructions on how to make it a global CLI utility- https://github.com/romansky/dom-to-semantic-markdown/blob/ma...

jejeyyy771y ago

hah, out of curiosity, what are you archiving and ingesting webpages for?

throwthrowuknow1y ago

Mostly for integration with my Obsidian vault so I don’t have to leave the app and can add notes and links and avoid linkrot.

zaSmilingIdiot1y ago

[0] https://github.com/mixmark-io/turndown

nbbaier1y ago· 3 in thread

This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.

Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`

[0]: https://val.town

leromanOP1y ago

Afraid to say that other than bumping into a talk about Deno, I haven’t played around with it yet.. So thanks for the heads up, will look into it.

Thanks for the bug report !

nbbaier1y ago

Happy to also take a swing at it, but it would take me a bit because I've never added such compatibility to a library before.

Any specific guidelines for contributing? I see that you're open to contributions.

leromanOP1y ago

1 more reply

richardreeze1y ago· 2 in thread

This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).

One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

leromanOP1y ago

Thanks for sharing!!

Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!

richardreeze1y ago

Done!

https://github.com/romansky/dom-to-semantic-markdown/issues/...

Zetaphor1y ago· 2 in thread

A browser demo would be a nice addition to this readme

leromanOP1y ago

Please see here- https://github.com/romansky/dom-to-semantic-markdown/blob/ma...

leromanOP1y ago

Ah, I suppose you mean a web page one could visit to see a demo :) Added to the backlog!

DevX1011y ago· 2 in thread

Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.

leromanOP1y ago

goatlover1y ago

What happened to using the semantic elements? Did that fall out of favor or the push for it get abandoned because popular frameworks just generate divs with semantic classes (hopefully)?

la_fayette1y ago· 1 in thread

leromanOP1y ago

This totally makes sense, I will look into adding support for additional ways to detect the main content, super interesting!

gradientDissent1y ago· 1 in thread

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.

leromanOP1y ago

Thank you! this is exactly why there's support for this specific use case- https://github.com/romansky/dom-to-semantic-markdown/blob/ma... (see `findContentByScoring`)

And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..

KolenCh1y ago· 1 in thread

I am curious how it would compare to using pandoc with readability algorithm for example.

leromanOP1y ago

Bumped this together with the side-by-side comparison task.. so will look into it :)

explosion-s1y ago· 1 in thread

leromanOP1y ago

ianbicking1y ago· 1 in thread

This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.

A few thoughts:

leromanOP1y ago

This is some great feedback, thanks!

2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).

3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help

5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future

6. Sounds cool but out of scope at the moment :)

DeveloperErrata1y ago

kartoolOz1y ago

WebArena does this really well, called the "accessibility_tree" https://github.com/web-arena-x/webarena/blob/main/browser_en...

nvartolomei1y ago

While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.

The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.

The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.

——

The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

alexliu5181y ago

KolenCh1y ago

brightvegetable1y ago

This is great, I was just in need of something like this. Thank!

Layvier1y ago

Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?

j / k navigate · click thread line to collapse