Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown (opens in new tab)

(jina.ai)

199 pointsmatteogauthier1y ago43 comments

43 comments

38 comments · 19 top-level

sippeangelo1y ago· 5 in thread

For as much as I would love for this to work, I'm not getting great results trying out the 1.5b model in their example notebook on Colab.

It is impressively fast, but testing it on an arxiv.org page (specifically https://arxiv.org/abs/2306.03872) only gives me a short markdown file containing the abstract, the "View PDF" link and the submission history. It completely leaves out the title (!), authors and other links, which are definitely present in the HTML in multiple places!

I'd argue that Arxiv.org is a reasonable example in the age of webapps, so what gives?

faangguyindia1y ago

Question is why even use these small models?

When you've Google Flash which is lightening fast and cheap.

My brother implemented it in option-k : https://github.com/zerocorebeta/Option-K

It's near instant. So why waste time on small models? It's going to cost more than Google flash.

dartos1y ago

Sometimes you don’t want to share all your data with the largest corporations on the planet.

oezi1y ago

What is Google Flash? Do you mean Gemini Flash? If so, then the article talks about that general purpose LLMs are worse than this specialized LLM for Markdown conversion.

1 more reply

randomdata1y ago

Small models often do a much better job when you have a well-defined task.

FL33TW00D1y ago

Privacy, Cost, Latency, Connectivity.

choeger1y ago· 4 in thread

Maybe I am missing something here, but why would you run "AI" on that task when you go from formal language to formal language?

I don't get the usage of "regex/heuristics" either. Why can that task not be completely handled by a classical algorithm?

Is it about the removal of non-content parts?

baq1y ago

There’s html and then there’s… html.

A nicely formatted subset of html is very different from a dom tag soup that is more or less the default nowadays.

JimDabell1y ago

Tag soup hasn’t been a problem for years. The HTML 5 specification goes into a lot more detail than previous specifications when it comes to parsing malformed markup and browsers follow it. So no matter the quality of the markup, if you throw it at any HTML 5 implementation, you will get the same consistent, unambiguous DOM structure.

1 more reply

faangguyindia1y ago

That's why the best strategy is to feed the whole page into LLM. (After removing html tags) and just ask LLM to give you the date you need in the format you need.

If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.

1 more reply

nickpsecurity1y ago

It’s informal language that has formal language mixed in. The informal parts determine how the final document should look. So, a simple formal-to-formal translation won’t meet their needs.

MantisShrimp901y ago· 2 in thread

I never really understand this reasoning of "regex is hard to reason about, so we just use an LLM we custom made instead!" I get it's trendy but reasoning about LLMs is impossible for many devs the idea that this makes it more maintainable is pretty hilarious.

nickpsecurity1y ago

Regex’s require you to understand what the obscure-looking patterns do character by character in a pile of text. Then, across different piles of text. Then, juggling different regex’s.

For a LLM, you can just tune it to produce the right output using examples. Your brain doesn’t have to understand the tedious things it’s doing.

This also replaces a boring, tedious job with one (LLM’s) that’s more interesting. Programmers enjoy those opportunities.

generalizations1y ago

In either case you end up with an inscrutable black box into which you pass your html...honestly I'd prefer the black box that runs more efficiently and is intelligible to at least some people (or most, with the help of a big LLM).

1 more reply

vladde1y ago· 2 in thread

Unfortunately not getting any good results for RFC 3339 (https://www.rfc-editor.org/rfc/rfc3339), such a page where I think it would be great to convert text into readable Markdown.

The end result is just like the original site but with without any headings and the a lot of whitespace still remaining (but with some non-working links inserted) :/

Using thei API link, this is what it looks like: https://r.jina.ai/https://www.rfc-editor.org/rfc/rfc3339

bberenberg1y ago

Tested it using the model in Google Colab and it did ok, but the output is truncated at the following line:

> [Appendix B](#appendix-B). Day

So not sure if it's the length of the page, or something else, but in the end, it doesn't really work?

lelandfe1y ago

That's their existing API (which I also tried, with... less than desirable results). This post is about a new model, `reader-lm`, which isn't in production yet.

alexdoesstuff1y ago· 2 in thread

Feels surprising that there isn't a modern best-in-class non-LLM alternative for this task. Even in the post, they described that they used a hodgepodge of headless Chrome, readability, lots of regex to create content-only HTML.

Best I can tell, everyone is doing something similar, only differing in the amount of custom situation regex being used.

monacobolid1y ago

How could it possibly be (a better solution) when there are X different ways to do any single thing in html(/css/js)? If you have a website that uses a canvas to showcase the content (think presentation or something like that), where would you even start? People are still discussing whether the semantic web is important; not every page is utf8 encoded, etc. IMHO small LLMS (trained specifically for this) combined with some other (more predictable) techniques are the best solution we are going to get.

alexdoesstuff1y ago

Fully agree on the premise: there are X different ways to do anything on the web. But - prior to this - the solution seemed to be: everyone starts from scratch with some ad-hoc Regex, and plays a game of whackamole to cover the first n of the x different ways to do things.

Best of my knowledge there isn't anything more modern than Mozilla's readability and that's essentially a tool from the early 2010s.

fsndz1y ago· 1 in thread

I can say that enough. Small Language Models are the future. https://www.lycee.ai/blog/why-small-language-models-are-the-...

Diti1y ago

An aligned future, for sure. Current commercial LLMs refuse to talk about “keeping secrets” (protection of identity) or pornographic topics (which, in the communities I frequent – made of individuals who have been oppressed partly because of their sexuality –, is an important subject). And uncensored AIs are not really a solution either.

igorzij1y ago· 1 in thread

Why Claude 3.5 Sonnet is missing from the benchmark? Even if the real reason is different and completely legitimate, or perhaps purely random, it comes across as "claude does better than our new model so we omitted it because we wanted the tallest bars on the chart to be ours". And as soon as the reader thinks that, they may start to question everything else in your work, which is genuinely awesome!

faangguyindia1y ago

It's damn slow and overkill for such task.

valstu1y ago· 1 in thread

So regex version still beats the LLM solution. There's also the risk of hallucinations. I wonder if they tried to make SML which would rewrite or update the existing regex solution instead of generating the whole content again? This would mean less output tokens, faster inference and output wouldn't contain hallucinations. Although, not sure if small language models are capabable to write regex

rockstarflo1y ago

I think regex can beat SLM for a specific use case. But for the general case, there is no chance you come up with a pattern that works for all sites.

foul1y ago· 1 in thread

When does this SML perform better than hxdelete (or xmlstarlet or whatever) + rdrview + pandoc?

fsiefken1y ago

The answer is in the OP's Reader-LM report:

About their readability-markdown pipeline: "Some users found it too detailed, while others felt it wasn’t detailed enough. There were also reports that the Readability filter removed the wrong content or that Turndown struggled to convert certain parts of the HTML into markdown. Fortunately, many of these issues were successfully resolved by patching the existing pipeline with new regex patterns or heuristics."

To answer their question about the potention of a SML doing this, they see 'room for improvement' - but as their benchmark shows, it's not up to their classic pipeline.

You echo their research question: "instead of patching it with more heuristics and regex (which becomes increasingly difficult to maintain and isn’t multilingual friendly), can we solve this problem end-to-end with a language model?"

tatsuya41y ago

In real-world use cases, it seems more appropriate to use advanced models to generate suitable rule trees or regular expressions for processing HTML → Markdown, rather than directly using a smaller model to handle each HTML instance. The reasons for this approach include:

1. The quality of HTML → Markdown conversion results is easier to evaluate.

2. The HTML → Markdown process is essentially a more sophisticated form of copy-and-paste, where AI generates specific symbols (such as ##, *) rather than content.

3. Rule-based systems are significantly more cost-effective and faster than running an LLM, making them applicable to a wider range of scenarios.

These are just my assumptions and judgments. If you have practical experience, I'd welcome your insights.

faangguyindia1y ago

My brother made: https://github.com/zerocorebeta/Option-K

Basically, it's utility which completes commandline for you

While playing with it, we thought about creating a custom small model for this.

But it was really limiting! If we use small model trained on MAN pages, bash scripts, stack overflow and forums etc...

We miss the key component, using a larger model like flash is more effective as this model knows lot more about other things.

For example, I can ask this model to simply generate a command that lets me download audio from a youtube url.

smusamashah1y ago

As per reddit their API that converts html to markdown can be used by appending url to https://r.jina.ai like https://r.jina.ai/https://news.ycombinator.com/item?id=41515...

I don't know if its using their new model or their engine

rwl41y ago

Not sure about the quality of the model's output. But I really appreciate this little mini-paper they produced. It gives a nice concise description of their goals, benchmarks, dataset preparation, model sizes, challenges and conclusion. And the whole thing is about a 5-10 minute read.

siscia1y ago

The more I think about the less I am completely against this approach.

Instead of applying an obscure set of heuristic by hand, let the LM figure out the best way starting from a lot of data.

The model is bound to be less debuggable and much more difficult to update, for experts.

But in the general case it will work well enough.

WesolyKubeczek1y ago

What ever happened to parsing HTML with regexes that you need a beefy GPU/CPU/NPU now to convert HTML to Markdown?

Onavo1y ago

We need one that operates on the visual output

Dowwie1y ago

I'm curious about the dataset. What scenarios need to be covered during training?

coreypreston1y ago

Pandoc does this very well.

denidoman1y ago

next step: websites add irrelevant text and prompt injections into hidden dom nodes, tags attributes, etc. to prevent llm-based scraping.

j / k navigate · click thread line to collapse

43 comments

38 comments · 19 top-level

sippeangelo1y ago· 5 in thread

For as much as I would love for this to work, I'm not getting great results trying out the 1.5b model in their example notebook on Colab.

I'd argue that Arxiv.org is a reasonable example in the age of webapps, so what gives?

faangguyindia1y ago

Question is why even use these small models?

When you've Google Flash which is lightening fast and cheap.

My brother implemented it in option-k : https://github.com/zerocorebeta/Option-K

It's near instant. So why waste time on small models? It's going to cost more than Google flash.

dartos1y ago

Sometimes you don’t want to share all your data with the largest corporations on the planet.

oezi1y ago

What is Google Flash? Do you mean Gemini Flash? If so, then the article talks about that general purpose LLMs are worse than this specialized LLM for Markdown conversion.

1 more reply

randomdata1y ago

Small models often do a much better job when you have a well-defined task.

FL33TW00D1y ago

Privacy, Cost, Latency, Connectivity.

choeger1y ago· 4 in thread

Maybe I am missing something here, but why would you run "AI" on that task when you go from formal language to formal language?

I don't get the usage of "regex/heuristics" either. Why can that task not be completely handled by a classical algorithm?

Is it about the removal of non-content parts?

baq1y ago

There’s html and then there’s… html.

A nicely formatted subset of html is very different from a dom tag soup that is more or less the default nowadays.

JimDabell1y ago

1 more reply

faangguyindia1y ago

That's why the best strategy is to feed the whole page into LLM. (After removing html tags) and just ask LLM to give you the date you need in the format you need.

If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.

1 more reply

nickpsecurity1y ago

It’s informal language that has formal language mixed in. The informal parts determine how the final document should look. So, a simple formal-to-formal translation won’t meet their needs.

MantisShrimp901y ago· 2 in thread

nickpsecurity1y ago

Regex’s require you to understand what the obscure-looking patterns do character by character in a pile of text. Then, across different piles of text. Then, juggling different regex’s.

For a LLM, you can just tune it to produce the right output using examples. Your brain doesn’t have to understand the tedious things it’s doing.

This also replaces a boring, tedious job with one (LLM’s) that’s more interesting. Programmers enjoy those opportunities.

generalizations1y ago

1 more reply

vladde1y ago· 2 in thread

Unfortunately not getting any good results for RFC 3339 (https://www.rfc-editor.org/rfc/rfc3339), such a page where I think it would be great to convert text into readable Markdown.

The end result is just like the original site but with without any headings and the a lot of whitespace still remaining (but with some non-working links inserted) :/

Using thei API link, this is what it looks like: https://r.jina.ai/https://www.rfc-editor.org/rfc/rfc3339

bberenberg1y ago

Tested it using the model in Google Colab and it did ok, but the output is truncated at the following line:

> [Appendix B](#appendix-B). Day

So not sure if it's the length of the page, or something else, but in the end, it doesn't really work?

lelandfe1y ago

That's their existing API (which I also tried, with... less than desirable results). This post is about a new model, `reader-lm`, which isn't in production yet.

alexdoesstuff1y ago· 2 in thread

Best I can tell, everyone is doing something similar, only differing in the amount of custom situation regex being used.

monacobolid1y ago

alexdoesstuff1y ago

Best of my knowledge there isn't anything more modern than Mozilla's readability and that's essentially a tool from the early 2010s.

fsndz1y ago· 1 in thread

I can say that enough. Small Language Models are the future. https://www.lycee.ai/blog/why-small-language-models-are-the-...

Diti1y ago

igorzij1y ago· 1 in thread

faangguyindia1y ago

It's damn slow and overkill for such task.

valstu1y ago· 1 in thread

rockstarflo1y ago

I think regex can beat SLM for a specific use case. But for the general case, there is no chance you come up with a pattern that works for all sites.

foul1y ago· 1 in thread

When does this SML perform better than hxdelete (or xmlstarlet or whatever) + rdrview + pandoc?

fsiefken1y ago

The answer is in the OP's Reader-LM report:

To answer their question about the potention of a SML doing this, they see 'room for improvement' - but as their benchmark shows, it's not up to their classic pipeline.

tatsuya41y ago

1. The quality of HTML → Markdown conversion results is easier to evaluate.

2. The HTML → Markdown process is essentially a more sophisticated form of copy-and-paste, where AI generates specific symbols (such as ##, *) rather than content.

3. Rule-based systems are significantly more cost-effective and faster than running an LLM, making them applicable to a wider range of scenarios.

These are just my assumptions and judgments. If you have practical experience, I'd welcome your insights.

faangguyindia1y ago

My brother made: https://github.com/zerocorebeta/Option-K

Basically, it's utility which completes commandline for you

While playing with it, we thought about creating a custom small model for this.

But it was really limiting! If we use small model trained on MAN pages, bash scripts, stack overflow and forums etc...

We miss the key component, using a larger model like flash is more effective as this model knows lot more about other things.

For example, I can ask this model to simply generate a command that lets me download audio from a youtube url.

smusamashah1y ago

As per reddit their API that converts html to markdown can be used by appending url to https://r.jina.ai like https://r.jina.ai/https://news.ycombinator.com/item?id=41515...

I don't know if its using their new model or their engine

rwl41y ago

siscia1y ago

The more I think about the less I am completely against this approach.

Instead of applying an obscure set of heuristic by hand, let the LM figure out the best way starting from a lot of data.

The model is bound to be less debuggable and much more difficult to update, for experts.

But in the general case it will work well enough.

WesolyKubeczek1y ago

What ever happened to parsing HTML with regexes that you need a beefy GPU/CPU/NPU now to convert HTML to Markdown?

Onavo1y ago

We need one that operates on the visual output

Dowwie1y ago

I'm curious about the dataset. What scenarios need to be covered during training?

coreypreston1y ago

Pandoc does this very well.

denidoman1y ago

next step: websites add irrelevant text and prompt injections into hidden dom nodes, tags attributes, etc. to prevent llm-based scraping.

j / k navigate · click thread line to collapse