Can we RAG the whole web? (opens in new tab)

(philippeoger.com)

21 pointsjeanloolz2y ago21 comments

21 comments

21 comments · 10 top-level

rthnbgrredf2y ago· 6 in thread

I think we need a search engine that has an API. Doesn't Kagi has an API?

Not the whole web; LinkedIn and a few others block us and we fully respect robots.txt, but we have ~8 billion pages.

edit: from article, "Doing this for a few urls is easy but doing it for billions of urls starts to get tricky and expensive (although not completely out of reach)" - indeed so, but we have now done embeddings for about half of those ~8 billion pages and are using them for mojeek.com.

We have an API with many features including uniquely authority and ranking scorings. Embeddings could be added.

https://www.mojeek.com/services/search/web-search-api/ used by Kagi, Meta and others. Self-disclosure; Mojeek team member.

jeanloolzOP2y ago

Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.

leodriesch2y ago

Brave Search also has one https://brave.com/search/api/

cpursley2y ago

Better pricing than Bing, as well. And their summary feature is pretty good.

simonw2y ago

Perplexity has models that include RAG access available via API - their "online" models. https://docs.perplexity.ai/docs/model-cards

Zambyte2y ago

Kagis API is very limited relative to what features they offer through their web interface. You can't search using the full results they serve. They only offer their index as an API, which I think is relatively small.

mooktakim2y ago· 3 in thread

Aren't the LLM's already trained on the whole web? no need to RAG, in theory.

simonw2y ago

Training doesn't work like that. Just because a model has been exposed to text in its training data doesn't mean the model will "remember" the details of that text.

Llama 3 was trained on 15 trillion tokens, but I can download a version of that model that's just 4GB in size.

No matter how "big" your model is there is still scope for techniques like RAG if you want it to be able to return answers grounded in actual text, as opposed to often-correct hallucinations spun up from the giant matrices of numbers in the model weights.

johnsutor2y ago

They're only trained up to a certain point in time, so adding RAG should hypothetically allow such LLMs to access the most up-to-date information.

ore0s2y ago

GPT-2 was launched in 2019, followed by GPT-3 in 2020, and GPT-4 in 2023. RAG is necessary to bridge informational gaps in between long LLM release cycles.

aleksiy1232y ago· 2 in thread

Is connecting a search engine to an LLM not technically a RAG for the whole web?

danans2y ago

"We" in the article refers to the collective of individual users, not search engine companies.

The spirit of the article is about how this can be achieved in a decentralized way without search engines, and instead with just your LLM and the embedding databases that it proposes that each website would publish.

A problem with this is that you still need to keep local copies of these databases that you get from crawling the web, and train your LLM to use it.

Zambyte2y ago

I don't think so. AFAIK when you do that, it searches normally based on the terms you give, and then does RAG on the results. Semantically this seems like "doing RAG on the web", but RAG is a specific operation that is only being applied to a subset of the web (indexed results) in that case.

manca2y ago

This is exactly what https://www.perplexity.ai/ is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.

The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.

For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.

mehulashah2y ago

Cool idea. This is a decentralized RAG approach and useful for individual site, e.g. those from Wordpress. How do you find the site that you want to "RAG" on, though? Individual domains can be vast, e.g. Google itself.

troupo2y ago

Well, there's nothing new under the sun. The whatever cooperation model you may have come up with, it has been invented again, and again, and again.

Before you invent a new protocol, look at Semantic Web (RDF et al), and Google Microformats, and...

simonw2y ago

FIYDRI^: The core idea discussed in this post is less about RAG and more about sharing web content in packages that are easier for crawlers to access - including an experiment that uses downloadable SQLite databases for that.

^ For If You Didn't Read It

transitivebs2y ago

this is exa's mission: https://exa.ai

leblancfg2y ago

I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.

Here's their blog article for it: https://help.kagi.com/kagi/ai/quick-answer.html You have to fire up your bullshit detector when looking at the results, but I find it saves a good 3/4 clicks on average.

bagels2y ago

"RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses."

j / k navigate · click thread line to collapse

21 comments

21 comments · 10 top-level

rthnbgrredf2y ago· 6 in thread

I think we need a search engine that has an API. Doesn't Kagi has an API?

ColinHayhurst2y ago

Not the whole web; LinkedIn and a few others block us and we fully respect robots.txt, but we have ~8 billion pages.

We have an API with many features including uniquely authority and ranking scorings. Embeddings could be added.

https://www.mojeek.com/services/search/web-search-api/ used by Kagi, Meta and others. Self-disclosure; Mojeek team member.

jeanloolzOP2y ago

Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.

leodriesch2y ago

Brave Search also has one https://brave.com/search/api/

cpursley2y ago

Better pricing than Bing, as well. And their summary feature is pretty good.

simonw2y ago

Perplexity has models that include RAG access available via API - their "online" models. https://docs.perplexity.ai/docs/model-cards

Zambyte2y ago

mooktakim2y ago· 3 in thread

Aren't the LLM's already trained on the whole web? no need to RAG, in theory.

simonw2y ago

Training doesn't work like that. Just because a model has been exposed to text in its training data doesn't mean the model will "remember" the details of that text.

Llama 3 was trained on 15 trillion tokens, but I can download a version of that model that's just 4GB in size.

johnsutor2y ago

They're only trained up to a certain point in time, so adding RAG should hypothetically allow such LLMs to access the most up-to-date information.

ore0s2y ago

GPT-2 was launched in 2019, followed by GPT-3 in 2020, and GPT-4 in 2023. RAG is necessary to bridge informational gaps in between long LLM release cycles.

aleksiy1232y ago· 2 in thread

Is connecting a search engine to an LLM not technically a RAG for the whole web?

danans2y ago

"We" in the article refers to the collective of individual users, not search engine companies.

A problem with this is that you still need to keep local copies of these databases that you get from crawling the web, and train your LLM to use it.

Zambyte2y ago

manca2y ago

mehulashah2y ago

troupo2y ago

Well, there's nothing new under the sun. The whatever cooperation model you may have come up with, it has been invented again, and again, and again.

Before you invent a new protocol, look at Semantic Web (RDF et al), and Google Microformats, and...

simonw2y ago

^ For If You Didn't Read It

transitivebs2y ago

this is exa's mission: https://exa.ai

leblancfg2y ago

I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.

bagels2y ago

j / k navigate · click thread line to collapse