undefined | Better HN

0 pointsbrazzledazzle11y ago0 comments

I know "good enough" is probably not just a good idea with a startup, it's possibly mandatory since there's only so much time and money. But as a user/consumer/customer/target demographic I can't begin to describe how much I disdain knowing that something exists on a site but being unable to find it using search, particularly when I know the exact title. Reddit's search several years ago was quite bad and left a sour taste in my mouth.

0 comments

1 comments · 1 top-level

rspeer11y ago

I'm already cringing about people in this thread talking about "language detection" and "stemming" as if there are good, easy solutions to them.

Take your favorite language detector, like cld2. Apply it to some real-world language, like random posts on Twitter. Did it detect the languages correctly? Welp, there goes that idea.

(Tweets are too short, you say? Tough. Search queries are shorter. You probably aren't lucky enough for your domain's text to be complete articles from the Wall Street Journal, which is what the typical NLP algorithm was trained on.)

Stemming will always be difficult and subtle. It's useful but it isn't even linguistically well-defined, so you'll have to tweak it a lot. If stemming seems easy, you haven't looked at where it goes wrong for your use case.

j / k navigate · click thread line to collapse