Show HN: Ichido, search engine that tags sites using Google and Cloudflare (opens in new tab)

(ichi.do)

122 pointsanthonyhn3y ago71 comments

Hello HN,

In my spare time I work on an experimental search engine named Ichido. Search is fascinating, there are so many features you can add to a search engine, but I find that the existing search engines are a bit limited in the features they have to offer. So I decided to work on my own search engine to test out different features, searching algorithms, and front ends in order to improve my (and hopefully others) searching experience.

Ichido includes a tagging system that provides more info on search results. For example, if a site links to Google services or uses Cloudflare, a tag is shown with the search result that let's the user know about that site's use of those services. Ichido also includes links to RSS feeds in search results, making it much easier to find RSS feeds.

This search engine is free to use, but if you like the service and want to support continued development please consider making a donation (Ichido currently supports donations through Libera Pay).

Show HN: Ichido, search engine that tags sites using Google and Cloudflare

(ichi.do)

122 pointsanthonyhn3y ago71 comments

Hello HN,

This search engine is free to use, but if you like the service and want to support continued development please consider making a donation (Ichido currently supports donations through Libera Pay).

71 comments

54 comments · 16 top-level

TekMol3y ago· 10 in thread

In your about page, I see you are using Bing's API. I didn't even know Bing has a search API that everyone can use!

How much do you have to pay them for this?

anthonyhnOP3y ago

It's $4/1000 queries, but the rate is increasing in May to $18/1000 queries. The Bing API is available through Azure.

esperent3y ago

> $4/1000

Is it just me or does this seem insanely expensive already?

1 more reply

danuker3y ago

I hope the author knows this, and won't be surprised by a bill more than 4x larger.

1 more reply

flas9sd3y ago

quite a price hike. Bad for the sustainability of the other portals I use that are backed by the bing index. Will either increase their pricing or efforts on monetization.

Is it easy for you to rely on more search index providers, what are your options?

1 more reply

gdcbe3y ago

Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

Correct me if I’m wrong though, but I’m pretty certain that all other search engines in the same category use one of these as their backend. Eg I’m pretty certain that counts for duckduckgo as well.

quectophoton3y ago

> Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

Money and resources and a dominant-enough position so that your crawlers are not blocked by websites.

Unfortunately.

1 more reply

antonok3y ago

Brave Search has its own independent index too - https://brave.com/brave-search-beta/

Santosh833y ago

I think Yandex have their own index and some others too like Marginalia, but the latter couldn't be called "in the same category" as the other three.

1 more reply

schemescape3y ago

What about Common Crawl?

https://commoncrawl.org/

daoudc3y ago

Mwmbl has its own index but it's orders of magnitude smaller than commercial search engines.

corobo3y ago· 7 in thread

Search engines will do literally anything except the option "never show results from this domain again"

Is there something obvious I'm missing that makes it infeasible, or maybe is it just something only I want?

As for this site there's too many tags for them to be useful imo. Give it 2 weeks of using the search engine and I bet you could hide silly fake tags in there and I'd never notice. Lots of tags = no tags.

I was picturing maybe a little pillbox type thing you might find appended to Google search results.

For instance when a result is a PDF: https://img.imgy.org/-7lq.jpg

prometheon13y ago

Ability to block/boost domains as seen in the following link:

https://blog.kagi.com/kagi-features

I only know about it because it pops up often on hn. Haven't tried it because at this point I don't want to pay $10 per month for search.

m3kw93y ago

Eventually someone will block enough big domains and render the search unusable and they may forget they actually did that. And there goes a user

factsaresacred3y ago

Including something like "7 results from sources you blocked are hidden. Click to show" would solve this nicely.

swyx3y ago

yea but you can just apply the filter on a per user basis. a literal bootcamp grad could write this

joshruby1233y ago

We're actually going to do that!!

eipi10_hn3y ago

Yes, there are search engines that let you do that

corobo3y ago

Such as..? I've seen Kagi as linked in a sibling comment which I'll give a go.

1 more reply

simultsop3y ago· 6 in thread

What would be the issue of being hosted on CF? I believe it is a better option than the rest of the shared hosting industry.. If nothing critical whats the intention of tagging?

danuker3y ago

http://crimeflare.eu.org

CloudFlare is a MitMaaS. Traffic is seen by them because they are in control of the HTTPS certificates, and you have to take them at their word that they do not log content (and even if they're not lying/under a gag order, just metadata is enough for a lot of evil things).

Tijdreiziger3y ago

So are AWS and Azure also MitMaaS?

If yes, what's the endgame? Everyone goes back to managing their own servers?

If no, why is Cloudflare the only hosting provider that gets singled out?

1 more reply

edaemon3y ago

Wouldn't this criticism apply to all content delivery networks? They have to terminate TLS in order to know which content to deliver.

corobo3y ago

Are you getting confused with a VPN? Those words in that order are a bad thing for a VPN, not so much a CDN.

These are all reasons I use Cloudflare lmao. Yes I need them to decrypt the traffic because they do various rules and caching for me. That DDoS protection would be pretty naff if they couldn't see the traffic! In one case I really wish they did log, I had to write my own Worker to log the info I needed.

If we were talking outbound proxy then fair enough but it's not like Cloudflare have strongarmed me into using them.. it was me that updated the NS records!

A lot of the list from that site just seem to describe what Cloudflare does, they don't seem to say why each thing is actually a bad thing.

Really does feel like someone's got a hate rod on for Cloudflare and tried to crowbar in as many VPN criticisms without understanding the difference between a VPN/proxy and a CDN.

1 more reply

binarymax3y ago

CF will force you to recaptcha if you try to remain anonymous

callahad3y ago

Is that universally true, or just when domains explicitly opt into specific traffic screening measures? Asking as I'm thinking of moving some stuff to Cloudflare Pages.

1 more reply

mg3y ago· 5 in thread

I run this search engine comparison tool:

https://www.gnod.com/search/

Just added Ichido.

Click on "more engines" to activate it.

culi3y ago

I made a post with a bunch of suggestions from my list and then my browser extension that limits my time on HN lost my whole comment including all the little explanations I had for each one. So here's my raw list instead haha

  meta
    https://www.gnod.com/search/
    https://github.com/searx/searx
  categories
    independent
      https://www.crawlson.com/
      https://search.marginalia.nu/
      https://wiby.me/
      https://searchmysite.net/
    international
      https://bonzamate.com.au/ australia
      https://www.baidu.com/ china
      https://yandex.com/ russia
    code
      https://searchcode.com/
      https://codesearch.ai/
      http://symbolhound.com/
      https://publicwww.com/
      https://search.feep.dev/
      http://codesearch.debian.net/
      https://codesearch.isocpp.org/
      https://www.programcreek.com/python/
      https://livegrep.com/search/linux
      https://grep.app/
    ai
      https://consensus.app/ scientific consensus
      https://github.com/jokenox/Goopt procedurally generated
      https://same.energy/ image similarity
    products
      https://www.looria.com/
      https://knifist.com/ knives
      https://attic.city/ home and fashion from indie stores
    topical
      https://biztoc.com/search business news
    premium
      https://kagi.com/
    other
      https://metager.org/ privacy centric engine that combines results of several engines
      https://thangs.com/ 3d models
      https://filmot.com/ youtube subtitles
  lists
    https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
    https://web.archive.org/web/20200710091019/http://www.jaruzel.com/textfiles/Old%20Web%20Info/Internet%20Search%20Engines%20v2.61.txt

Hope its useful still

marban3y ago

Can you add https://biztoc.com/search ? (Real-time business/finance News) Zero Tracking/Cookies.

daoudc3y ago

Nice! Please consider adding https://mwmbl.org

Thanks!

FireInsight3y ago

You could add other AI search such as perplexity.ai and phind.com

anthonyhnOP3y ago

>Just added Ichido.

Thanks, much appreciated

partyguy3y ago· 4 in thread

Nice project! However, when trying to search for my site (https://spacehey.com), it shows multiple tags, with most of them being false (Cloudflare, UTM Tracking, WEBP Images). I used Cloudflare at one point in the past, but don't anymore. Additionally, there has never been UTM tracking or anything like that nor WEBP images... Where do you get such data from?

Apart from that, awesome project!

anthonyhnOP3y ago

Since spacehey includes user-submitted content, it's possible that:

* Someone uploaded a WEBP image to the site.

* Someone pasted a link with a utm_* param.

* The page was crawled when cloudflare was used.

Will look into it and see if I can find the pages that generated the tags. Search results are generally tagged by domain name (necessary since not all pages can be crawled, and even if the page the user connects to doesn't have, for example google trackers, a user would likely want to know if the site is using trackers elsewhere).

Also love the spacehey project, really captures the feel of Myspace!

anthonyhnOP3y ago

EDIT: I found some of the pages with links that include UTM tracking params. Let me know if you want me to send you the pages with those links, can send them through email (my email is on the contact page of the site).

partyguy3y ago

Oh, I see - that makes perfect sense! Thank you for the clarification!

Glad you like SpaceHey :)

Keep up the great work!

return_to_monke3y ago

what's wrong with webp?

1 more reply

bastawhiz3y ago· 2 in thread

What's the use case for this? If I don't want Google scripts, I block them. I'll use a user agent that doesn't download or run them. If I don't want cookies, I'll instruct my browser not to save cookies. What situation would I be in where knowing whether a site uses these things is a search result I want to visit?

brucethemoose23y ago

I find the extra information useful, as I dont have to visit the site to find out.

bastawhiz3y ago

But what does it matter? If you're blocking it anyway, what difference does it make whether the site has it or not? I genuinely don't know why knowing this in advance is helpful and want to know what I'm missing

jesprenj3y ago· 1 in thread

An interesting search proxy is also SearX. Written in Python, it supports many backend engines and can be self hosted.

And here's a lightweight frontend/proxy I wrote in C for using Google search on low-end phones that can't render bloated HTML (SearX was too complicated to install):

http://searc.4a.si:7327/search?q=news

It's also nice that the structured never constantly changing HTML it produces makes it ideal to programatically query Google. Although you still run into captchas which it cannot solve if queries get too suspicious.

b1ue643y ago

Also see SearXNG https://github.com/searxng/searxng/

ocdtrekkie3y ago· 1 in thread

This looks great, I am really glad to see things making it more obvious how pervasive malicious Google scripts are.

I find the webp flag interesting, as I don't think webp itself is inherently harmful, except for being an image spec that solely exists because Google NIHs everything and wants to write their own everything. (Long live JPEG-XL!)

I'm curious why you chose to tag it explicitly though.

brucethemoose23y ago

I love that tag, as (to me) it indicates a site is trying to be bandwidth efficient instead of just defaulting to JPEG.

JXL is pretty much dead thanks to Google... and avif is still mostly suited to thumbnails.

danuker3y ago· 1 in thread

Thank you! I think any competition is welcome for search engines, with Google going down the monetization path.

A piece of feedback: When I select "Remove top ...." and click Submit, then click Next, the popularity filter is gone.

Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.

anthonyhnOP3y ago

>Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.

Thank you, great feedback! You're right, I forgot to include some of the params in the pagination, will have to include those in the next update.

flas9sd3y ago· 1 in thread

I see you offer an opensearch.xml already - if you embed it as link node with the appropriate type it will be straightforward to add it to the browser as (default) search engine: https://developer.mozilla.org/en-US/docs/Web/OpenSearch#auto...

also: happy to give this a try, more knobs for power users

anthonyhnOP3y ago

> I see you offer an opensearch.xml already - if you embed it as link node with the appropriate type it will be straightforward to add it to the browser as (default) search engine

Thanks for the heads up, I used to have a <link rel="search"> to the opensearch in a prior iteration of the site, must have removed it by mistake. Will add in the link in the next release.

superasn3y ago

I think the tags can be grouped like Extereme trackers, Moderate trackers, etc and clicking on them expands the full list.

Also one really useful tag would be "Affiliate links" if there is a way to identify a page contains affiliate links like amazon affiliate, etc. Those pages are always almost crap.

Also a tag for "Modal popups", those are too often just marketing related websites and definitely want to skip it if I know prior to visiting.

coolspot3y ago

I would prefer more logical tags like “top 1k”, “aggregator”, “user-generated content” than technical like “utm” and “obfuscated scripts”. Also, I would prefer tags grouped together into expandable lists and not shown all by default. Every site uses javascript, I don’t want to see it over and over again unless specifically queried for that.

1vuio0pswjnm73y ago

The pagination keep increasing past the point where Bing will provide no more results. Testing a popular search term, for which there are no doubt millions of results, it was only possible to get new results up to page 45. Yet the website will keep incrementing the page number and result numbers as if new results are being returned.

Then tried same search with popularity set to 500000 and could not even get a single full page of 10 results. It's laughable to assume from this "search" that only, say, 500004 out of the millions of websites in existence include this term. Not that I want to browse a full list, but at least I want to know how many hits I got. Then I can add more terms and try to reduce that number.

daoudc3y ago

This is really cool! Please consider joining forces with us at mwmbl.org, would love to incorporate some of these ideas.

jacooper3y ago

Brave goggles also do something similar, allowing to filter search the way to you want.

KomoD3y ago

Too many tags, and if a site has something, like scripts, why do you say "may"?

If a site has scripts then it's not "This site may be using Javascript", it's for sure that the site uses it...?

And popularity filter doesn't work, the results are empty and if you try going to any of the other pages it removes the filter

j / k navigate · click thread line to collapse

71 comments

54 comments · 16 top-level

TekMol3y ago· 10 in thread

In your about page, I see you are using Bing's API. I didn't even know Bing has a search API that everyone can use!

How much do you have to pay them for this?

anthonyhnOP3y ago

It's $4/1000 queries, but the rate is increasing in May to $18/1000 queries. The Bing API is available through Azure.

esperent3y ago

> $4/1000

Is it just me or does this seem insanely expensive already?

1 more reply

danuker3y ago

I hope the author knows this, and won't be surprised by a bill more than 4x larger.

1 more reply

flas9sd3y ago

quite a price hike. Bad for the sustainability of the other portals I use that are backed by the bing index. Will either increase their pricing or efforts on monetization.

Is it easy for you to rely on more search index providers, what are your options?

1 more reply

gdcbe3y ago

Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

quectophoton3y ago

> Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

Money and resources and a dominant-enough position so that your crawlers are not blocked by websites.

Unfortunately.

1 more reply

antonok3y ago

Brave Search has its own independent index too - https://brave.com/brave-search-beta/

Santosh833y ago

I think Yandex have their own index and some others too like Marginalia, but the latter couldn't be called "in the same category" as the other three.

1 more reply

schemescape3y ago

What about Common Crawl?

https://commoncrawl.org/

daoudc3y ago

Mwmbl has its own index but it's orders of magnitude smaller than commercial search engines.

corobo3y ago· 7 in thread

Search engines will do literally anything except the option "never show results from this domain again"

Is there something obvious I'm missing that makes it infeasible, or maybe is it just something only I want?

I was picturing maybe a little pillbox type thing you might find appended to Google search results.

For instance when a result is a PDF: https://img.imgy.org/-7lq.jpg

prometheon13y ago

Ability to block/boost domains as seen in the following link:

https://blog.kagi.com/kagi-features

I only know about it because it pops up often on hn. Haven't tried it because at this point I don't want to pay $10 per month for search.

m3kw93y ago

Eventually someone will block enough big domains and render the search unusable and they may forget they actually did that. And there goes a user

factsaresacred3y ago

Including something like "7 results from sources you blocked are hidden. Click to show" would solve this nicely.

swyx3y ago

yea but you can just apply the filter on a per user basis. a literal bootcamp grad could write this

joshruby1233y ago

We're actually going to do that!!

eipi10_hn3y ago

Yes, there are search engines that let you do that

corobo3y ago

Such as..? I've seen Kagi as linked in a sibling comment which I'll give a go.

1 more reply

simultsop3y ago· 6 in thread

What would be the issue of being hosted on CF? I believe it is a better option than the rest of the shared hosting industry.. If nothing critical whats the intention of tagging?

danuker3y ago

http://crimeflare.eu.org

Tijdreiziger3y ago

So are AWS and Azure also MitMaaS?

If yes, what's the endgame? Everyone goes back to managing their own servers?

If no, why is Cloudflare the only hosting provider that gets singled out?

1 more reply

edaemon3y ago

Wouldn't this criticism apply to all content delivery networks? They have to terminate TLS in order to know which content to deliver.

corobo3y ago

Are you getting confused with a VPN? Those words in that order are a bad thing for a VPN, not so much a CDN.

If we were talking outbound proxy then fair enough but it's not like Cloudflare have strongarmed me into using them.. it was me that updated the NS records!

A lot of the list from that site just seem to describe what Cloudflare does, they don't seem to say why each thing is actually a bad thing.

Really does feel like someone's got a hate rod on for Cloudflare and tried to crowbar in as many VPN criticisms without understanding the difference between a VPN/proxy and a CDN.

1 more reply

binarymax3y ago

CF will force you to recaptcha if you try to remain anonymous

callahad3y ago

Is that universally true, or just when domains explicitly opt into specific traffic screening measures? Asking as I'm thinking of moving some stuff to Cloudflare Pages.

1 more reply

mg3y ago· 5 in thread

I run this search engine comparison tool:

https://www.gnod.com/search/

Just added Ichido.

Click on "more engines" to activate it.

culi3y ago

  meta
    https://www.gnod.com/search/
    https://github.com/searx/searx
  categories
    independent
      https://www.crawlson.com/
      https://search.marginalia.nu/
      https://wiby.me/
      https://searchmysite.net/
    international
      https://bonzamate.com.au/ australia
      https://www.baidu.com/ china
      https://yandex.com/ russia
    code
      https://searchcode.com/
      https://codesearch.ai/
      http://symbolhound.com/
      https://publicwww.com/
      https://search.feep.dev/
      http://codesearch.debian.net/
      https://codesearch.isocpp.org/
      https://www.programcreek.com/python/
      https://livegrep.com/search/linux
      https://grep.app/
    ai
      https://consensus.app/ scientific consensus
      https://github.com/jokenox/Goopt procedurally generated
      https://same.energy/ image similarity
    products
      https://www.looria.com/
      https://knifist.com/ knives
      https://attic.city/ home and fashion from indie stores
    topical
      https://biztoc.com/search business news
    premium
      https://kagi.com/
    other
      https://metager.org/ privacy centric engine that combines results of several engines
      https://thangs.com/ 3d models
      https://filmot.com/ youtube subtitles
  lists
    https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
    https://web.archive.org/web/20200710091019/http://www.jaruzel.com/textfiles/Old%20Web%20Info/Internet%20Search%20Engines%20v2.61.txt

Hope its useful still

marban3y ago

Can you add https://biztoc.com/search ? (Real-time business/finance News) Zero Tracking/Cookies.

daoudc3y ago

Nice! Please consider adding https://mwmbl.org

Thanks!

FireInsight3y ago

You could add other AI search such as perplexity.ai and phind.com

anthonyhnOP3y ago

>Just added Ichido.

Thanks, much appreciated

partyguy3y ago· 4 in thread

Apart from that, awesome project!

anthonyhnOP3y ago

Since spacehey includes user-submitted content, it's possible that:

* Someone uploaded a WEBP image to the site.

* Someone pasted a link with a utm_* param.

* The page was crawled when cloudflare was used.

Also love the spacehey project, really captures the feel of Myspace!

anthonyhnOP3y ago

partyguy3y ago

Oh, I see - that makes perfect sense! Thank you for the clarification!

Glad you like SpaceHey :)

Keep up the great work!

return_to_monke3y ago

what's wrong with webp?

1 more reply

bastawhiz3y ago· 2 in thread

brucethemoose23y ago

I find the extra information useful, as I dont have to visit the site to find out.

bastawhiz3y ago

jesprenj3y ago· 1 in thread

An interesting search proxy is also SearX. Written in Python, it supports many backend engines and can be self hosted.

And here's a lightweight frontend/proxy I wrote in C for using Google search on low-end phones that can't render bloated HTML (SearX was too complicated to install):

http://searc.4a.si:7327/search?q=news

b1ue643y ago

Also see SearXNG https://github.com/searxng/searxng/

ocdtrekkie3y ago· 1 in thread

This looks great, I am really glad to see things making it more obvious how pervasive malicious Google scripts are.

I'm curious why you chose to tag it explicitly though.

brucethemoose23y ago

I love that tag, as (to me) it indicates a site is trying to be bandwidth efficient instead of just defaulting to JPEG.

JXL is pretty much dead thanks to Google... and avif is still mostly suited to thumbnails.

danuker3y ago· 1 in thread

Thank you! I think any competition is welcome for search engines, with Google going down the monetization path.

A piece of feedback: When I select "Remove top ...." and click Submit, then click Next, the popularity filter is gone.

Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.

anthonyhnOP3y ago

>Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.

Thank you, great feedback! You're right, I forgot to include some of the params in the pagination, will have to include those in the next update.

flas9sd3y ago· 1 in thread

also: happy to give this a try, more knobs for power users

anthonyhnOP3y ago

> I see you offer an opensearch.xml already - if you embed it as link node with the appropriate type it will be straightforward to add it to the browser as (default) search engine

Thanks for the heads up, I used to have a <link rel="search"> to the opensearch in a prior iteration of the site, must have removed it by mistake. Will add in the link in the next release.

superasn3y ago

I think the tags can be grouped like Extereme trackers, Moderate trackers, etc and clicking on them expands the full list.

Also one really useful tag would be "Affiliate links" if there is a way to identify a page contains affiliate links like amazon affiliate, etc. Those pages are always almost crap.

Also a tag for "Modal popups", those are too often just marketing related websites and definitely want to skip it if I know prior to visiting.

coolspot3y ago

1vuio0pswjnm73y ago

daoudc3y ago

This is really cool! Please consider joining forces with us at mwmbl.org, would love to incorporate some of these ideas.

jacooper3y ago

Brave goggles also do something similar, allowing to filter search the way to you want.

KomoD3y ago

Too many tags, and if a site has something, like scripts, why do you say "may"?

If a site has scripts then it's not "This site may be using Javascript", it's for sure that the site uses it...?

And popularity filter doesn't work, the results are empty and if you try going to any of the other pages it removes the filter

j / k navigate · click thread line to collapse