story

Indexing a billion pages (opens in new tab)

blog.mwmbl.org

122 pointsdaoudc2y ago35 comments

35 comments

xnx2y ago

How does the homepage of https://mwmbl.org/ not have a single sentence explaining what it is or even an "About" link?

From Github: "Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed."

CharlesW2y ago

Here's its mission: https://blog.mwmbl.org/articles/non-profit-search-engine/

renegat0x02y ago

I suggest also providing og title and og image fields for social media.

Kiro2y ago

Not everything is a product that needs to be sold.

xnx2y ago

Totally agree. I was just trying to figure out what it is. Even something a small as a subheading like Wikipedia ("The Free Encyclopedia") does would be very helpful.

hawski2y ago

Everything is something. It is helpful to know what this particular something is - regardless if it is sold or not.

jetrink2y ago

> We’ve indexed over 100 million pages

> [W]e’re crawling up to a million pages a day, as you can see on our stats page.

> Given that Mwmbl is still relatively unknown, it seems plausible that we can reach our target of crawling three billion pages a day, to refresh the entire index in one month.

I think this is supposed to read "it seems plausible that we can reach our target of crawling three million pages a day."

hinkley2y ago

> to refresh the entire index [of 1 billion] in one month.

Seems you are correct.

halfdan2y ago

Except then it needs to be 30M pages/day. 3M * 30 days just nets you 90M

bdcravens2y ago

Most impressive part:

> Our estimated annual budget is $752.36 and we have spent $174.49.

_w1tm2y ago

If you value engineering time at zero you can spend months / years / decades / eons optimizing your solution for cost.

mdaniel2y ago

I thought I recalled seeing this before due to its Welsh name and (as is often the case) some are from their domain and some are from the GitHub repo; the ones with over 100 comments are

https://news.ycombinator.com/item?id=37561155

https://news.ycombinator.com/item?id=29690877

dang2y ago

Thanks! Macroexpanded:

We are entering a new era of web search - https://news.ycombinator.com/item?id=38465864 - Nov 2023 (2 comments)

Mwmbl: Free, open-source and non-profit search engine - https://news.ycombinator.com/item?id=37561155 - Sept 2023 (122 comments)

The Book of Mwmbl: a free, non-profit search engine - https://news.ycombinator.com/item?id=33828087 - Dec 2022 (8 comments)

Show HN: An open source web crawler for the Mwmbl non-profit search engine - https://news.ycombinator.com/item?id=31765015 - June 2022 (4 comments)

Show HN: I'm building a non-profit search engine - https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199 comments)

marginalia_nu2y ago

I'll race you there ;-)

Alifatisk2y ago

I remember reading about a project who’s sole purpose is to provide a large index of the open web for free, anyone could download it. Forgot the name of the project.

Why can’t mwmbl download their index?

Also, is mwmbl planning on providing their crawled index for free? Like, can I also download it later?

If that is the case, I’s happily download their FF extension.

Mayzie2y ago

What you’re likely referring to is Common Crawl: https://commoncrawl.org

Alifatisk2y ago

Probably, but I don’t remember the website looking like that, it had more colors to it, like red yellow blue etc.

Update: Yes, it's commoncrawl but with updated design. https://archive.is/4dz8G

KomoD2y ago

I installed the extension because why not, then I noticed it was only crawling spam pages and redirect links that were being abused for spam... I guess it's kind of expected but not sure how I feel about it

Though, it's still a very cool idea, maybe an option to crawl sites I visit would be nice?

mdaniel2y ago

> Though, it's still a very cool idea, maybe an option to crawl sites I visit would be nice?

I had an extension installed for a while that submitted pages to the Internet Archive automatically, but it was a constant battle to remember to denylist any sites that were personal (bank, doctor, whatever) before visiting them. That was before I was heavy into the Firefox containers setup, so if I were to try that again I'd try to find a way to disable it for those containers (which, come to think of it, may be yet another container feature request)

Having thought a little further about your suggestion, I could imagine an extension that merely submitted the window.location.origin to the search engine and let it index the site, as a heuristic for "this site is popular enough to have received a visit in the past hour/day/whatever" but with Mwmbl specifically that'd put things back in a loop since it would send the site back to your browser to index it for them

I sure do hope Mwmbl's extension is not using the full browser context to make requests, otherwise any request to index mail.google.com would be no bueno

Alifatisk2y ago

Oh, so the extension is not crawling the website you visit but crawls the open web in the background?

How much cpu / ram does it consume?

1 more reply

mdaniel2y ago

> The biggest expense was purchasing a PyCharm professional license at $116.58

I mean, awesome that they value good tooling to spend on it but https://www.jetbrains.com/community/opensource/ almost certainly means they qualify for a complementary license

Alifatisk2y ago

How do I identify my hash among the users in the stats https://mwmbl.org/stats ?

Alifatisk2y ago

What's the consequence of installing the crawler to FF? Can the ISP / Cloudflare / any other party start blacklisting you?

hinkley2y ago

Well, the internet could know the balance of your bank account at that boutique credit union, or which sex toys you’ve been looking at buying.

We do web hosting for businesses. We are careful to identify bots because we don’t want personalized recommendations in search engines, and don’t want PII showing up. Not that we put a lot of that in the site, but think about a Home Depot or a Best Buy situation. If you’re a human near the Minneapolis store, you want availability for that store, not the Detroit one. But corporate wants a bot to know that this microwave is something you in theory carry, but might have to ship from a warehouse. If Home Depot wants that, and our customers also want that, then probably a lot of businesses feel the same.

Alifatisk2y ago

But the extension is not crawling my stuff, it takes batches of urls from the host server. How is the internet going to get my bank account?

KomoD2y ago

> Can the ISP / Cloudflare / any other party start blacklisting you?

Yes, wouldn't surprise me if you were to get blocked from tons of sites and flagged as a bot on Cloudflare and other WAFs.

Alifatisk2y ago

Yeah getting blocked from Cloudflares network is not something I want to experience, it's a hell.

hcfman2y ago

Wuite curious. What indexing and retrieval software is this using? I couldn’t find reference to it.

Does it index phrases ?

bdcravens2y ago

"... who crawl the web using the Firefox extension and command line script"

https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-cra...

https://github.com/mwmbl/crawler-script

mdaniel2y ago

I believe this is closer to the thing you were asking about, and the simple answer appears to be "a home grown one in python" https://github.com/mwmbl/mwmbl/blob/e544d45c374c13cdc1a5048d...

jmclnx2y ago

Very interesting and was quick for me. Nice work!

foreigner2y ago

Do they use Common Crawl?

urbandw311er2y ago

I think the saddest part of this is that, owing to the total enshittification of the web due to SEO, at least 50% of what they index will be absolute garbage.

KomoD2y ago

Higher, run the extension and look at the feed, it's like 90% garbage

j / k navigate · click thread line to collapse

35 comments

xnx2y ago

How does the homepage of https://mwmbl.org/ not have a single sentence explaining what it is or even an "About" link?

From Github: "Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed."

CharlesW2y ago

Here's its mission: https://blog.mwmbl.org/articles/non-profit-search-engine/

renegat0x02y ago

I suggest also providing og title and og image fields for social media.

Kiro2y ago

Not everything is a product that needs to be sold.

xnx2y ago

Totally agree. I was just trying to figure out what it is. Even something a small as a subheading like Wikipedia ("The Free Encyclopedia") does would be very helpful.

hawski2y ago

Everything is something. It is helpful to know what this particular something is - regardless if it is sold or not.

jetrink2y ago

> We’ve indexed over 100 million pages

> [W]e’re crawling up to a million pages a day, as you can see on our stats page.

> Given that Mwmbl is still relatively unknown, it seems plausible that we can reach our target of crawling three billion pages a day, to refresh the entire index in one month.

I think this is supposed to read "it seems plausible that we can reach our target of crawling three million pages a day."

hinkley2y ago

> to refresh the entire index [of 1 billion] in one month.

Seems you are correct.

halfdan2y ago

Except then it needs to be 30M pages/day. 3M * 30 days just nets you 90M

bdcravens2y ago

Most impressive part:

> Our estimated annual budget is $752.36 and we have spent $174.49.

_w1tm2y ago

If you value engineering time at zero you can spend months / years / decades / eons optimizing your solution for cost.

mdaniel2y ago

I thought I recalled seeing this before due to its Welsh name and (as is often the case) some are from their domain and some are from the GitHub repo; the ones with over 100 comments are

https://news.ycombinator.com/item?id=37561155

https://news.ycombinator.com/item?id=29690877

dang2y ago

Thanks! Macroexpanded:

We are entering a new era of web search - https://news.ycombinator.com/item?id=38465864 - Nov 2023 (2 comments)

Mwmbl: Free, open-source and non-profit search engine - https://news.ycombinator.com/item?id=37561155 - Sept 2023 (122 comments)

The Book of Mwmbl: a free, non-profit search engine - https://news.ycombinator.com/item?id=33828087 - Dec 2022 (8 comments)

Show HN: An open source web crawler for the Mwmbl non-profit search engine - https://news.ycombinator.com/item?id=31765015 - June 2022 (4 comments)

Show HN: I'm building a non-profit search engine - https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199 comments)

marginalia_nu2y ago

I'll race you there ;-)

Alifatisk2y ago

I remember reading about a project who’s sole purpose is to provide a large index of the open web for free, anyone could download it. Forgot the name of the project.

Why can’t mwmbl download their index?

Also, is mwmbl planning on providing their crawled index for free? Like, can I also download it later?

If that is the case, I’s happily download their FF extension.

Mayzie2y ago

What you’re likely referring to is Common Crawl: https://commoncrawl.org

Alifatisk2y ago

Probably, but I don’t remember the website looking like that, it had more colors to it, like red yellow blue etc.

Update: Yes, it's commoncrawl but with updated design. https://archive.is/4dz8G

KomoD2y ago

Though, it's still a very cool idea, maybe an option to crawl sites I visit would be nice?

mdaniel2y ago

> Though, it's still a very cool idea, maybe an option to crawl sites I visit would be nice?

I sure do hope Mwmbl's extension is not using the full browser context to make requests, otherwise any request to index mail.google.com would be no bueno

Alifatisk2y ago

Oh, so the extension is not crawling the website you visit but crawls the open web in the background?

How much cpu / ram does it consume?

1 more reply

mdaniel2y ago

> The biggest expense was purchasing a PyCharm professional license at $116.58

I mean, awesome that they value good tooling to spend on it but https://www.jetbrains.com/community/opensource/ almost certainly means they qualify for a complementary license

Alifatisk2y ago

How do I identify my hash among the users in the stats https://mwmbl.org/stats ?

Alifatisk2y ago

What's the consequence of installing the crawler to FF? Can the ISP / Cloudflare / any other party start blacklisting you?

hinkley2y ago

Well, the internet could know the balance of your bank account at that boutique credit union, or which sex toys you’ve been looking at buying.

Alifatisk2y ago

But the extension is not crawling my stuff, it takes batches of urls from the host server. How is the internet going to get my bank account?

KomoD2y ago

> Can the ISP / Cloudflare / any other party start blacklisting you?

Yes, wouldn't surprise me if you were to get blocked from tons of sites and flagged as a bot on Cloudflare and other WAFs.

Alifatisk2y ago

Yeah getting blocked from Cloudflares network is not something I want to experience, it's a hell.

hcfman2y ago

Wuite curious. What indexing and retrieval software is this using? I couldn’t find reference to it.

Does it index phrases ?

bdcravens2y ago

"... who crawl the web using the Firefox extension and command line script"

https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-cra...

https://github.com/mwmbl/crawler-script

mdaniel2y ago

I believe this is closer to the thing you were asking about, and the simple answer appears to be "a home grown one in python" https://github.com/mwmbl/mwmbl/blob/e544d45c374c13cdc1a5048d...

jmclnx2y ago

Very interesting and was quick for me. Nice work!

foreigner2y ago

Do they use Common Crawl?

urbandw311er2y ago

I think the saddest part of this is that, owing to the total enshittification of the web due to SEO, at least 50% of what they index will be absolute garbage.

KomoD2y ago

Higher, run the extension and look at the feed, it's like 90% garbage

j / k navigate · click thread line to collapse