Although, am I the only one who finds that searching HN through Google gives better results than searching it through the Algolia powered HN search?
For example, search "ml" in both, Algolia results are years old and don't seem that relevant, whereas Google picks up more recent threads.
And it's probably why also Algolia shines as search engine for a website vs searching the internet like Google. The former has a small scope (looking for videogames on twich, polo on lacoste) while the latter must be personalized for the user (the snake python vs the programming language python).
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
https://www.google.com/search?q=news.ycombinator.com%3A+ml&o...
Some people says it's popularity based, but if I change it to date based, it's broken? https://imgur.com/a/HGzL6fO
For example, the poster below showed the following link for the "ml" query: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que....
If you look at the results, the first 2 results only matched part of the poster's username and many of the top results here are empty threads which are not that useful since HN is mostly for discussions.
To be clear, I'm not saying it's all bad, just pointing out that there are low hanging fruits which can improve results quite a bit.
Also, Algolia seems misspelled in your query which is why it fails when reordering by date: https://imgur.com/a/pO9fBWr, it works otherwise. Although, it says the date is 1 day old when it's just 2 hours old.
The customizable stuff are very helpful -- search by user, by type, order stories by popularity or date, etc.
I too built a search engine:
https://demo.insideropinion.com/meta_profiles?utf8=%E2%9C%93...
I find it works 100x better for me, when I weight results by my expertise. Basically, returning results based on what I already am familiar with.
It just seems to me that HN search does a simple keyword matching and reordering based on points without any ranking (please someone correct if I'm wrong).
So freshness doesn't seem to have any value, but the reality is that, at least in tech, very few content is evergreen. Also, when you try to reorder by date for example, you'll end up with a lot of empty threads which again are not that useful. I'm not sure if there's a way of eliminating empty threads from search results.
Paying per use certainly doesn’t make sense, because it has to be qualified by the accuracy you get per use.
And there’s no serious way to understand the accuracy you get per use (on your specific unusual distribution of queries) without employing the expensive ML / stats engineers you probably thought you could avoid hiring by outsourcing to Algolia / Rekognition in the first place.
But once you need to hire them anyway, you might as well utilize them to build this type of thing in-house in ways that are much more tailored and optimized around your in-house data models and data integration tools.
To put in perspective, I’ve worked in several companies (from small start-ups to large ecommerce sites) that have a variety of search needs spanning plug and play Lucene all the way to highly customized joint embedding neural network based nearest neighbor search, and tons in between.
The distribution of text in e.g. the support center search use case was totally different than the product search use case or the document store use case, where highly unique word distribution, special words, frequency of required updates to the search index, asymmetric costs of surfacing bad or deleted content items, etc., was the norm.
Every search use case was different and needed care to develop unique annotated result sets to measure mean reciprocal rank, NDCG, etc., as well as simple stakeholder subjective opinion of quality.
Short of basically hiring Algolia to be a gigantic consultant on all these things, I don’t see how it could actually be valuable.
I suspect it’s just an easy sell to CTO types that don’t really understand. They want “search” to be one problem with one little component to drop in to solve it, but it’s just not real.
Fundamentally, there is value to most businesses in being able to just buy a decent solution to a non core competency.
That’s where Algolia and AWS and basically all service companies come in... a medium scale clothing manufacturer with a booming e-commerce site may well know they have no clue how to do search, and no clue how to assess and hire individuals who could implement it, and no clue how to find and hire a cio who could put together a team from scratch who could do this on a reasonable timeline.
I have one client in particular that is a stark indicator of this trend - 50+ year old company and their second floor where they used to have 30+ developers and sysadmins and a server room downstairs has now been remodelled into a break room and new offices for their new team of 5 (all awesome replacing a ton of mediocre people who didn't get much done for a decade)
They're doing better, their products are more popular, they don't have to worry about recruiting developers + sysadmins, their current IT staff get paid better and they're saving money.
I find Algolia interesting in that they've managed to capture something that Elasitc didn't - and it could be because of a prevailing wisdom similar to that of grandfather's comment
Bingo. Exactly that. My core competency isn’t (nor do I want it to be,) implementing and maintaining a search system. Same reason I use Twillio, SendGrid, and Heroku.
To even know if you’re buying a decent solution from Algolia or not, you’d already have to hire pretty much all the same staff you’d have to hire to more cost-effectively build it in-house.
I think the fundamental myth, just like with Rekognition, is that if you ship off your data and the third party trains some model (most likely fine-tuning a base model), then you’re done, problem solved.
Even for businesses where search is not a core part of their direct value proposition to customers this is flagrantly untrue.
I signed up for AWS specifically to use Rekognition. I use it to screen alerts from my security cameras. In short, Blue Iris detects motion, a Node-RED flow grabs an image and uses Rekognition to see what's in it, if there's a person detected the Node-RED flow notifies me via PushOver. This significantly reduces the false-positives that inevitably happen on windy days - I've already done a lot of work in Blue Iris on this, but passing alerts through Rekognition makes it almost perfect. Based on my testing, this reduces false-positives to zero and hasn't yet produced a false-negative.
Based on my usage I expect my costs to be ~$5/mo once I'm no longer in the free tier. This is cheaper than the person detection service that Blue Iris natively integrates with and is significantly less effort to get up and running compared to, for example, TensorFlow. I also assume that Amazon will periodically update their detection models to make it better, which is one less thing for me to worry about.
For me, all of these benefits are worth the ~$5/mo.
But you’re just proving my point. It wouldn’t make sense to use Rekognition unless you had someone with skills to assess the classifier accuracy in the context of your specific problem. For example, it seems like your loss function places an asymmetrically higher cost on false negatives. (Incidentally, it’s interesting you claim it hasn’t produced a false negative ... did you watch every frame of video and make sure?)
If you replace your simple one man operation with a simple loss function on an amount of data you can manually evaluate with instead a complex computer vision workflow, say where face or person detection has legal consequences for a company that sells or licenses stock photography, or an image or video search tool trying to avoid surfacing porn or pirated content, etc. then Rekognition becomes no longer useful, because you’ll need not just one person doing cursory evaluation of false negatives, but a team of people building out a benchmark-like battery of automated evaluations with probably IoU metrics in addition to classifier metrics and will need to figure out how many errors they can tolerate in some cost budget combined with the normal cost budget of usage to Rekognition.
Basically, for some tiny hobbyist use case, I guess it’s fine (though really you could literally just load some Keras model pre-trained on imagenet or some off the shelf version of yolo and save yourself $5/mo) but the value proposition falls apart as soon as the cost function becomes a complicated business one.
This according to an old friend who worked there allowed them to really drill down to why which search results are shown and hence the pay per use does actually make sense.
Many search tasks really do need machine learning, especially variations on collaborative filter and matrix factorization. Mixed modality search often truly does need deep learning and wasn’t even really possible at a level of fidelity suitable for real use cases until maybe 10 years ago.
If Algolia was categorically omitting a whole class of possible solutions, that would be a big red flag, certainly not a reason to think they can drill down to understand search results better.
I worked once on a large ecommerce search engine that had been built with Solr, and the sort order involved crazy hand-tuned boosting scores applied to ngrams of different sizes. None of it was reproducible, nobody knew where the magic boost weights came from, and as the quality of results started to plummet, there was no way to fix it. Everyone was too afraid to modify the magic constants because even slight perturbations created stark visual errors. And this was just for a super simple non-normalized term frequency matrix with boosts. “Not using machine learning” is not at all a signal that your solution won’t end up as a black box with no interpretability.
You may simply not be able to do this at all. You might not know how to tell good ML/Stats people from bad. You might not be able to pay them competitively.
You might simply want something that's better than nothing, with "nothing" being your realistic alternative. "Expert in-house ML team" is not an alternative many companies can get, and even for the ones that could, it'll take a while. What are you going to do in the meantime?
But as an unabashed Angolia devotee, I think the value prop of InstantSearch is a no brainer. It's worthwhile looking at the product itself, as an almost textbook example of how to package services to enterprise customers.
Just take a look at the actual business performance of Elastic.
$271m in sales for the last fiscal year. Negative $101 million operating loss. Pretty bad, although not extremely unusual for SaaS companies in high growth mode. So there must be great growth going on, right? No.
They added a mere $9m in sales last quarter. $89m in sales with a $42m operating loss (whoa). They added an additional $10m in operating loss and gained a mere $9m in sales.
So if they can keep up that rate of growth, they might generously add $45-$50 million in sales this year. Maybe 16%-18% growth for a company bleeding red ink, that isn't particularly large in terms of sales yet (ie they're struggling to generate fast growth at a small'ish scale). And all that needs to support trading for 20 times sales on a business that is a decade old and has never produced a profit.
Either they find a lot of growth soon or in the next market down cycle Elastic is worth 1/3 to 1/2 of what they're trading for now. The same will probably go for their lesser peers. The clock is starting to tick hard on these extreme valuations (hello WeWork, Uber, Lyft).
Algolia, because it doesn't have an open source offering, doesn't have that issue.
but I think mission-critical search is a niche technical problem that isn’t applicable to many enterprises.
change sort by popularity to sort by date -> ~30k results
I don't really understand how sorting can affect the number of results. Btw youtube search does this too.
- End-to-end search:
Algolia's offerings span both front-end (InstantSearch drop-in widgets) and back-end (actual search API). Simple applications can be built without ever talking to Algolia API at all because their widgets do it for you.
Atlas is all back-end - just a DB service with FTS on the side; left to you to integrate front-end.
- Configuration:
Algolia's dashboard GUI is where a lot of the configuration is done. Some configurations are not available at all via APIs. It's relatively simple.
Atlas requires more JSON-type configuration entries, and some knowledge of Lucene internals.
- Text analysis:
Algolia text tokenization pipeline is mostly a black-box but works fine most of the time. It exposes only a few settings like ascii-folding. It's fine for normal dictionary words, but has problems with domain-specific text (for example, people/place names, scientific terms, etc).
Atlas exposes many aspects of Lucene's analysis pipeline, but it does require knowledge of Lucene.
- Multilingual support:
Algolia supports all its features for ~70 languages.
Atlas analysis has to be configured separately for each language.
- Query syntax:
Algolia defaults to simple queries but the API supports a more complex query syntax with boolean operators and such.
Atlas has its own JSON query DSL that's related to Lucene's query syntax capabilities.
- Faceting:
Algolia faceting configuration and API are far simpler than Atlas's DSL.
I can’t think of any SaaS business that invested in, and executed such a smooth onboarding and retention ecosystem.
I’ve used them for small sites and large enterprise clients (10B+) and I’ve always felt like I’ve got way more than I’ve paid for.
PSA: Algolia has basically hidden a category making/taking over strategy in plain sight.
Now if Firebase could only buy them out and add decent search to their suite of products that would be swell. Mind boggling that Firebase - part of Google - still lacks a decent search solution.
https://stackshare.io/posts/how-algolia-built-their-realtime...
The homepage is great from a developer perspective, select your backend on the left, frontend on the right, and you get an idea right away of what a basic implementation looks like. Very clever.