From Github: "Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed."
> [W]e’re crawling up to a million pages a day, as you can see on our stats page.
> Given that Mwmbl is still relatively unknown, it seems plausible that we can reach our target of crawling three billion pages a day, to refresh the entire index in one month.
I think this is supposed to read "it seems plausible that we can reach our target of crawling three million pages a day."
> Our estimated annual budget is $752.36 and we have spent $174.49.
We are entering a new era of web search - https://news.ycombinator.com/item?id=38465864 - Nov 2023 (2 comments)
Mwmbl: Free, open-source and non-profit search engine - https://news.ycombinator.com/item?id=37561155 - Sept 2023 (122 comments)
The Book of Mwmbl: a free, non-profit search engine - https://news.ycombinator.com/item?id=33828087 - Dec 2022 (8 comments)
Show HN: An open source web crawler for the Mwmbl non-profit search engine - https://news.ycombinator.com/item?id=31765015 - June 2022 (4 comments)
Show HN: I'm building a non-profit search engine - https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199 comments)
Why can’t mwmbl download their index?
Also, is mwmbl planning on providing their crawled index for free? Like, can I also download it later?
If that is the case, I’s happily download their FF extension.
Update: Yes, it's commoncrawl but with updated design. https://archive.is/4dz8G
Though, it's still a very cool idea, maybe an option to crawl sites I visit would be nice?
I had an extension installed for a while that submitted pages to the Internet Archive automatically, but it was a constant battle to remember to denylist any sites that were personal (bank, doctor, whatever) before visiting them. That was before I was heavy into the Firefox containers setup, so if I were to try that again I'd try to find a way to disable it for those containers (which, come to think of it, may be yet another container feature request)
Having thought a little further about your suggestion, I could imagine an extension that merely submitted the window.location.origin to the search engine and let it index the site, as a heuristic for "this site is popular enough to have received a visit in the past hour/day/whatever" but with Mwmbl specifically that'd put things back in a loop since it would send the site back to your browser to index it for them
I sure do hope Mwmbl's extension is not using the full browser context to make requests, otherwise any request to index mail.google.com would be no bueno
How much cpu / ram does it consume?
I mean, awesome that they value good tooling to spend on it but https://www.jetbrains.com/community/opensource/ almost certainly means they qualify for a complementary license
We do web hosting for businesses. We are careful to identify bots because we don’t want personalized recommendations in search engines, and don’t want PII showing up. Not that we put a lot of that in the site, but think about a Home Depot or a Best Buy situation. If you’re a human near the Minneapolis store, you want availability for that store, not the Detroit one. But corporate wants a bot to know that this microwave is something you in theory carry, but might have to ship from a warehouse. If Home Depot wants that, and our customers also want that, then probably a lot of businesses feel the same.
Yes, wouldn't surprise me if you were to get blocked from tons of sites and flagged as a bot on Cloudflare and other WAFs.
Does it index phrases ?
https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-cra...