undefined | Better HN

story

0 pointsanewhnaccount9y ago0 comments

Good for what? Even though this isn't good for use as an every day general purpose search engine, it could be good for a particular use case perhaps with some adaptation or for learning from.

0 comments

kowdermeister9y ago

I don't know why would people use it to be frank. Lot better alternatives exists.

> it could be good for a particular use case

Namely?

> or for learning from.

The author admitted in the github readme that the code quality is rather bad. I also don't see a link to the search index, the only valuable component of this project.

deusu9y ago

I will publish the index for download in a few weeks. I'm currently working on the documentation. Oh, and I will publish the raw crawl-data too. Everything together is about 2.5tb.

There is also a free API in beta-test right now. Will probably be ready for official release next week.

adrianratnapala9y ago

I think such publication is very important. I have no idea at all if open source search engines could ever work.

But if they can, I think a big part of it would be separation of the crawl index and the UI / prioritisation etc. Different people can work on those two ends of the problem and apply different philosophies.

Search only forums? Reject porn using XYZ method? Great! But they can all use the same database, or pick from the a common community of databases.

kowdermeister9y ago

That's great news, thanks for the info. Sorry for sounding harsh, for being a side project this is impressive.

Have you also published the ranking mechanism? That way people might contribute you to improve it.

deusu9y ago

It's all open-source. So, yes.

webtechgal9y ago

It would be great if you can share (at least) some information about the kind of hosting setup you're using, how much of bandwidth and how long it took to crawl and index the 2B pages.

deusu9y ago

4 servers in total.

2 are used for crawling, index-building and raw-data storage. Quadcore, 32gb RAM, 4tb HDD and 1gbit/s internet connection on each of these. They are rented and in a big data-center. Crawling uses "only" about 200-250mbit/s of bandwidth.

2 servers for webserver and queries. Quadcore, 32gb RAM. One with 2x512gb SSD, the other with only 1x512gb SSD. These servers are here at home. I have cable internet with 200mbit/s down, 20mbit/s up. Static IPs obviously.

A full crawl currently takes about 3 months.

wila9y ago

Thank you for your work. Keep going on it.

Wired you a small donation as I think it is important to have alternatives.

deusu9y ago

Thank you!

j / k navigate · click thread line to collapse

0 comments

kowdermeister9y ago

I don't know why would people use it to be frank. Lot better alternatives exists.

> it could be good for a particular use case

Namely?

> or for learning from.

The author admitted in the github readme that the code quality is rather bad. I also don't see a link to the search index, the only valuable component of this project.

deusu9y ago

I will publish the index for download in a few weeks. I'm currently working on the documentation. Oh, and I will publish the raw crawl-data too. Everything together is about 2.5tb.

There is also a free API in beta-test right now. Will probably be ready for official release next week.

adrianratnapala9y ago

I think such publication is very important. I have no idea at all if open source search engines could ever work.

Search only forums? Reject porn using XYZ method? Great! But they can all use the same database, or pick from the a common community of databases.

kowdermeister9y ago

That's great news, thanks for the info. Sorry for sounding harsh, for being a side project this is impressive.

Have you also published the ranking mechanism? That way people might contribute you to improve it.

deusu9y ago

It's all open-source. So, yes.

webtechgal9y ago

It would be great if you can share (at least) some information about the kind of hosting setup you're using, how much of bandwidth and how long it took to crawl and index the 2B pages.

deusu9y ago

4 servers in total.

A full crawl currently takes about 3 months.

wila9y ago

Thank you for your work. Keep going on it.

Wired you a small donation as I think it is important to have alternatives.

deusu9y ago

Thank you!

j / k navigate · click thread line to collapse