We’ve found that most technical searches fall into a few categories: ad-hoc how-tos, understanding an API, recalling forgotten details, research, or troubleshooting. Google is too broad and shallow of a search tool to be good at this. Even after sifting through the deluge of spammy, irrelevant sites pumped full of SEO, you still have to manually find your answer through discussion boards or documentation. Their “featured snippet” approach works for simple factoid queries but quickly falls apart if a question requires reasoning about information across multiple webpages.
Our approach is narrow and deep — to retrieve detailed information for topics relevant to developers. When you submit a query, we pull raw site data from Bing, rerank them, and extract understanding and code snippets with our proprietary large language models. We use seq-to-seq transformer models to generate a final explanation from all of this input.
For our honors theses at UT Austin, we researched prototypes of large generative language models that can answer complex questions by combining information from multiple sources. We found that GPT-3, GPT-Neo/J/X, and similar autoregressive language models that predict text from left to right are prone to “hallucinating” and generating text inconsistent with the “ground truth” document. Training a sequence-to-sequence language model (T5 derivative) on our custom dataset designed for factual generation yielded much better results with less hallucination.
After creating this prototype, we started actively developing Hello with the idea that searching should be just like talking to a smart friend. We want to build an engine that explains complex topics clearly and concisely, and lets users ask follow-up questions using the context of their previous searches.
For example, when asked “what type of semaphore can function as a mutex?”, Hello pulls in the raw text from all five search results linked on the search page to generate: “A binary semaphore can be used as a mutex. Mutexes and semaphores are two different types of synchronization mechanisms. A mutex is a lock that prevents two threads from accessing the same resource at the same time. A semaphore is used to signal that a resource has become available.” We're biased, of course, but we think that the ability to reason abstractly about information from multiple web pages is a cool thing in a search engine!
We use BERT-based models to extract and rank code snippets if relevant to the query. Our search engine currently does well at answering applicable how-to questions such as “Sort a list of tuples by the second element”, “Set a response cookie in FastAPI”, “Get value of input in React”, “How to implement Dijkstra's algorithm.” Exclusively using our own models has also freed us from dependence on OpenAI.
Hello is and will always be free for individual devs. We haven’t rolled out any paid plans yet, but we’re planning to charge teams per user/month to use on internal data scattered around in wikis, documentation, slack, and emails.
We started Hello Cognition to scratch our own itch, but now we hope to improve the state of information retrieval for the greater developer community. If you'd like to be part of our product feedback and iteration process, we'd love to have you—please contact us at founders@sayhello.so.
We're looking forward to hearing your ideas, feedback, comments, and what would be helpful for you when navigating technical problems!
* I won't use a different search engine for programmers stuff vs everything else. So while this might be targeted toward software developers I can't see myself using it unless it can handle normal searches.
* UI/UX - I hate the progress bar, I'm not sure at all what it's telling me as there are results shown while it's still completing. The results are way too spaced out. On my 27" 2K screen I can only see 3 results, the search bar takes up way too much space and there is way too much padding on the results. Don't move the DOM on me, removing the progress bar is jarring as is "Was this answer helpful?" popping in, I'm here for results, not to train your ML.
* Trackers - Using the default installs of Privacy Badger and uBlock Origin meant no results ever loaded. I'm not sure what was being blocked that caused the issue but cookies from bing [0] and a request to cloudflareinsights.com [1] should not hamper showing results.
Search is a tool and one that I need to be quick, simple, and informationally dense. This checks almost none of those boxes. I'm even open to using a different search engine (I semi-recently switched from Google to Ecosia and it's been near-seamless), but I don't see any "pro" to using this engine and I see a ton of "cons".
[0] https://cs.joshstrange.com/V5uiyM
[1] https://cs.joshstrange.com/GkPuap
EDIT: I did a few more searches because I realized I wasn't getting the "info box"/ML results on my first few searches and I wanted to be fair. Sorry but that made me dislike this even more. I really, really hate content moving out from under me. My eyes start reading one of the 3 results that was shown then they got pushed down for another overly-padded box that tried to "answer" my question. The results were worse than "grab the selected answer from the first SO that matches this query". Maybe it would be better if that info was shown off to the side and didn't move the results when it loaded in but again, I didn't find it useful in the queries where it showed up.
EDIT2: I posted a follow up comment about what, specifically, I think should be changed: https://news.ycombinator.com/item?id=32005841
I'm more than willing to open another tab to not have a search result page full of YouTube videos.
Also I've never seen a "search result page full of YouTube videos" no matter what the query was. Sometimes there will be 3 or a carousel of them near the top but I can easily ignore that (assuming they aren't relevant or useful to me). I can't remember the last time I got a video for a code/technical query on google, I just did some testing and only a few queries showed videos, always 3, always partway down the page so that the search results at the top answered what I needed before I even got to the videos.
To the SayHello team: kudos for being faster than Google to release a Q&A+search system. I was expecting something like this for a couple of years wondering why Google was sleeping on its mountain of papers and not doing it.
Search was the first step in finding information, Q&A is the next logical step. Language models+search such as DeepMind RETRO have shown this approach to be very efficient: 25x reduction in model size for the same perplexity and verifiable correct answers with source document references.
In the future I expect search to become more like an assistant with context and language abilities. Retrieving a bunch of web pages is so 2000's. Q&A is especially relevant for mobile use with speech interface (hello Siri and Google Assistant).
From the creators:
> We're building a better search engine for software developers.
Also no you can't "talk to it", I'm not sure where you got that idea. It has a "Ask a follow up" but that performs a new search with none of the context of your previous search (also this UI of sliding a modal up from the bottom and layering the results is terrible).
> Search was the first step in finding information, Q&A is the next logical step.
And we are clearly not there. Not only does this not allow you to ask follow-ups to refine but it doesn't give good results in my testing.
I'd use this as a ddg bang[1]. I don't use them often, as ddg is a great search engine, but some search engines handle certain queries better and ddg lets you route queries efficiently.
I'd love to hear more about what you mean by "informationally dense" -- some search engines simply show more information on the results page, but that doesn't make results inherently better in my opinion because it frequently simply increases noise relative to signal.
Our current approach is to provide only the most relevant answers/code snippets and nothing else (high signal with low noise) as opposed to cramming in every Stack Overflow answer we can find. We realize we still have a long way to go to make it magical for every search, but we're working on it.
My suggestions:
* Kill the padding/margins, it's pretty for demos or certain cases but I want to be able to see more information, heavy padding/margins have no place in search results.
* Shrink the search bar to the upper left like every other search engine. Keeping it centered with tons of padding wastes space. Take your logo and put it to the left of the search field, take the buttons and put them to the right. On my screen you are burning a little over 500px of vertical space with things that don't matter, the results matter.
* Shrink your "regular" search results to be half the width of the screen (on desktop, something like a max of 700, Google uses ~640 as does Ecosia). Use the space to the right to show your AI/ML results. This means no content will jump around and people can more easily read the results, full-width is very hard to read. Also shorten the "description" under the links. 2 lines max (at 640px width).
* Either don't ask "Was this answer helpful?" (use hints like: Did they click the link? Did they leave the site after seeing the results?) OR don't make it move the content (hold the space empty if you must animate it in, just don't let the content shift multiple times after doing a search).
Here is your default result for "this is a test" search query: https://cs.joshstrange.com/oKbz6G
Here it is with a bunch of padding/margins removed: https://cs.joshstrange.com/VEVXGh
Yes, I removed the logo/buttons because that was faster than moving them to the left/right of the search but the end result is the same. In my tightened up version you can fit 8+ result links where the initial version could only show 3, also all the results are easier to read.
But if the "automatic" answer fails and I need to skim results, as I'll often need to do, you put 3 result previews in a space DDG and Google fit 5. They also apply reasonable defaults for the max line length - a basic typography thing that improves quick readability a lot.
* juice
* v8 engine
* juice
* v8 engine
* juice
so definitely some non-programming searches showing up, unfortunately none of the documentation sources for v8.
All that said the UI/UX is too frustrating to use (as-in) even if they don't promote programming content over non.
* we also needed to build a strong "everything else" search engine and then
* have great results for coding with specific search apps like StackOverlfow, AI code complete, ++
* be very fast (we messed that up when we first launched)
* have great scores on Privacy Badger, be compatible with uBlock, etc.
Last week we've started opening up our platform to collaborate on results with outside developers and have gotten a lot of interest: https://about.you.com/developers/
Maybe we can collaborate also with you guys at sayhello. Ping me at hey@you.com if you want to compare notes.
I searched the following in say hello.so.
"Service worker fails on request for audio file"
I got back a couple of results related to general service worker use but none that get close to discussing the core problem that lead to the solution.
The same query in Google returns several results that together pointed me to the solution (it was around range headers in requests for media data types).
This is just one example though. I think the problem you are trying to fix is worth the effort. I just wonder if this is where humans are still stronger than computers - gathering unstructured data to use in problem solving.
Then again maybe that's just me.
It would be amazing if this could be used for internal documentation however. Like we have so much documentation on our wiki which is just disorganised.
Also, stack overflow's search has always sucked. The way to find stuff on stack overflow has mostly been to use google.
Is that legal?
Isn't there copyright on those?
So I only see one of two outcomes:
1. Courts rule copilot is fair use in which case your search engine becomes largely superfluous
2. Courts rule copilot is infringement in which case all of these types of applications cannot be used commercially
1. Copilot itself infringing licenses (MS copying and sharing copyrighted code)
2. Developer infringing licenses (Allowing code from MS into own codebase).
Case 2 is avoided by Hello, because it provides a link to the original, allowing the developer to find and respect the license. Therefore Hello is net superior (with respect to people using the service at least).
> Hello pulls in the raw text from all five search results linked on the search page to generate...
Not to be negative but I think I'll stick to the sites and people that made the results and not a middleman that intends to charge for other people's work.
Dark mode is not a core value proposition.
(my guess is that) The logo and search bar take up a lot of space because they are mimicking the design of the Google.com landing page.
It seems like the bulk of their work has been on the search itself, so I would forgive them on logo and branding. It’s an early product so logo and branding can change.
For now, they just need constructive feedback on workflow and usability.
Well, everybody is different. I just hate dark mode.
When I come to a website that defaults to dark mode and I can't see a way to change it, I leave immediately.
It may be a weird suggestion, but if the query to general topics returns something like this https://unzip.dev/archive (check how compact it is and delivers almost all you need to know about the subject to get you going), it would be perfect.
Query: how to base64 encode a string in ruby
Response: I'm not sure what you mean by "base64 encode a string in ruby" - that's a bit of a misnomer. Base64 encoding is a way of storing data in a form that can be decoded by a human. It's not a secure way to store data, but it's useful if you want to send a message to someone who doesn't understand the language you're using.
The right answer is in the third link provided but it's not exactly correct.
Google gives back the Ruby Module Base64 docs as the first hit.
- stackoverflow's UI actually serves well to provide a sort of "ambient" information that rapidly indicates not just the best answers, but the best most-recent answers. Oftentimes, and especially in rapidly-evolving dev languages/frameworks, what was the best answer a few months ago may no longer be the best answer and the ability to rapidly scan the comments that would indicate this is valuable. - in addition those stackoverflow comments and links within them can point to additional info that can save the dev time (potentially pointing to the dev misidentifying the problem: "don't do this, this is the real issue <link>).
I think with the traditional google->stackoverflow or google->[some documentation site, forum, etc] user flow you actually get layers of ambient cues as to relevance, recency and quality that we've grown accustom to. Even if your product ultimately serves better answers I'd worry that lacking these cues would make a user like me feel as though I'm blindly trusting an answer that seems to have come from the ether (sort of like github copilot).
As low-hanging fruit maybe adding level-meters beside each result that indicates these dimensions could help (like npmjs.com does with npm pkg results in their ui).
I love the product idea and it looks like a strong start! Good luck!
One feature request at first glance: please default to the system font stack for code snippets. I see you're currently using Consolas, a Microsoft typeface, which is not pleasant to see as a mac user.
You can use this to default to the system font on every platform:
font-family: "SF Mono", "Monaco", "Inconsolata", "Fira Mono", "Droid Sans Mono", "Source Code Pro", monospace;Let's say I'm searching for front-end frameworks. Each article has the word "best" in the title, yet doesn't link to resources like State of JS, Stack Overflow Survey or other similar sites. So, in this context "best" is subjective. I can't be bothered with subjective results when I'm trying to find out what is actually considered "best" or in this case popular.
https://beta.sayhello.so/search?q=Java+aot+compile
Does not seem to mention graal anywhere. (It's just a random test query that popped into my mind)
Asking a full question for a code snippet seems to work: https://beta.sayhello.so/search?q=How+do+I+sort+a+map+in+Jav...
How do you deal with licensing for these snippets though. Is that up to the user to verify?
It is currently up to the user to verify licensing for the snippets, but we try to make it easy (using the See Reference button) to go to the original source.
"meta programming python" does not give as good results as
https://beta.sayhello.so/search?q=meta+programming+python
"how to implement a meta class in python"
https://beta.sayhello.so/search?q=how+to+implement+a+meta+cl...
https://beta.sayhello.so/search?q=hello+world+in+brainfuck
Nice idea for the project though. Good luck with it
The term hallucinating is brilliant for how these AI systems seem to generate output.
Your product is very interesting, seems to work nicely on easy queries "how do I sort an array of objects in JavaScript". But was quite confusing for complex queries.
The UI doesn't work too well on mobile, but it's a beta and software is written on the desktop.
I also think making this a specific search engine for companies internal messy data would be a very useful tool as well.
I wonder what you think about that. Maybe one could submit a code snippet, or mark something as an error, or ask for a refactor of some code. But then again, this gets close to what copilot is doing.
TypeError: N.at is not a function. (In 'N.at(-1)', 'N.at' is undefined)My co-founder and I were building the same product as you are some time ago [1]. We managed to scale it to around 5k WAU before we decided to pivot for various reasons.
If you think there might be any useful information and experience we could share with you, please shoot me an email - vasek@usedevbook.com. I'd love to help in any way I can to help you guys succeed.
I've played around just a bit and clicked some of the preset examples and like what I'm seeing so far. I bookmarked it and will try it out more as I code over the next few days.
Main initial feedback: I'd really like to see version/last-updated-at info accompanying all results. One of the biggest problems with Google for code stuff is finding outdated examples and docs. Even better would be a dropdown that lets me see results depending on the version of the language/framework/tools I'm using.
How do you see navigating this space when this can be considered a nice to have versus a strict need?
Maybe half the time I know what I want (eg: the order of values in the animation CSS property), and from who (eg: MDN), so I just go to the relevant docs page via google with something like “MDN animation css”.
The other 50% of the time, I’m searching an exact error string, probably in quotes, on google. For that I also don’t really want a knowledge graph answer, I’d much rather see a GitHub issue or stack overflow post and I’ll derive the context I need.
We currently do well at answering "ad hoc how-to" questions and are working to improve our answers in the other categories. This could either look like augmenting our existing natural language answer, or building a separate view specializing in official documentation or errors.
[0] https://elisehe.in/i/googledrivendevelopment.pdf [1] https://bootcamp.uxdesign.cc/the-hidden-insights-in-develope...
"fhir appointment spec
I'm not sure what you're asking about, but I'll try to answer it as best I can. __ Appointment is a FHIR data type. It's a way to describe a time slot for a patient to be seen by a healthcare provider. Appointments can be booked, cancelled, rescheduled, or canceled and rebooked. It can also be used to describe the location of the appointment. "
Pretty impressive summary given that it doesn't exist in any one specific page.
It would be a good idea to preserve whitespace, or arguably better, integrate optional syntax formatting
Overall, this search engine looks promising
The animations and page jumpiness are a bit off-putting and slow, but it is a beta!
I applaud you for trying to make a new search engine - it's not something sane people would try to do because of a certain behemoth eating everyone's lunch. It's going to take extraordinary insight and out of the box thinking to get something really good.
Here's a rather trivial example: Q: "Who founded Y Combinator?" A: "Paul Graham founded Y Combinator with Jessica Livingston and Trevor Blackwell."
If you scroll to the bottom of the answer page to ask a follow-up question: Q: "How old is he?" A: "Paul Graham is 57 years old. He founded Y Combinator in 2005."
This is an amazing product, btw. Let me know if you're looking for people to hire :)
And thank you :) it's comments like this that really fire us up
Also, I noticed in your palindrome reference example, it didn't choose the accepted answer from Stack Overflow. How did it choose the example? Also, the 2nd 2 reference panes, I can't tell what value they are adding. They seem like a list of random outputs of the ispalindrome script.
Showing an answer written by a human as a part of the code snippet is also a good idea.
The same is with any input, even with predefined ones; the progress bar gets to the end (slowly) and nothing happens.
Firefox works.
When I compared the output between Hello and Bing, the filtering worked pretty well. It removed most of the StackOverflow results, which are 99% of the time not insightful.
Great job on this search engine and congrats on the release.
That's strange. I searched the same term (as well as let archive.is do it[1][2]) and results are similar. A workshop paper available on GitHub, a Springer Link reference entry, and a GitHub project. Indeed Hello has a single blog post whereas Google has lecture notes but there isn't a difference in content since the blog post is written in same style (essentially being a tl;dr version of the paper that introduces the methods).
[1]: https://archive.ph/6SkWZ (Hello) [2]: https://archive.ph/90jY4 (Google)
Though I know the first search term was unrelated but I tried it for some time as a regular search engine, tried a bunch of random keywords apart from what it was meant to do as well as some that can be legitimate questions as well.
I am liking the product quite a bit however. Good stuff.
https://beta.sayhello.so/search?q=Immediately+Invoked+Functi...
If I specifically list Python/Javascript the first couple results are not even in that language, 3rd/4th are. And you have to click link/see reference to even see the language
You would think if your language is included in the query it should be heavily prioritised
Could you elaborate more on this or point to a paper/benchmark results?
Some of my results with code examples looked awfully similar to GitHub CoPilot output.
Is that being used to generate results sometimes?
We actually do have a code generation model similar to CoPilot but it's not active yet on the backend. All of the code snippets you see are pulled from other websites.
Overall, our goal is to have the highest signal-to-noise ratio of any search engine when it comes to developer searches.
Once in a blue moon I need to remember what the syntax for C++ explicit template instantiation is. All I need is a short snippet showing me an example of the syntax, but usually this means asking google and then trawling through several tangential SO questions ("why would one use this feature?") or scrolling through cppreference until I am reasonably confident I am looking at a valid example. This, to me, sounds exactly like the use case you are targeting.
Here is the kind of output that would have been meaningful/useful to me (although I was only looking for the second part, since I already knew the theory just not the syntax):
An explicit template instantiation definition (usually placed in a source file) makes the compiler instantiate the template for the given arguments. An explicit template instantiation declaration (usually placed in the header) tells the compiler that the template will already be instantiated elsewhere, so implicit instantiation can be skipped.
template <class T>
class Foo {};
// Explicit template instantiation declaration:
extern template class Foo<int>;
// Explicit template instantiation definition:
template class Foo<int>;
I tried six different queries, of the form "C++ explicit template instantiation [declaration|definition] [syntax]".In all cases, the synthesized explanation was either gibberish or flat out wrong. For example:
> Explicit template instantiation is a feature of C++11 that allows you to declare a template as a class, rather than a function. This means that if you want to use the template in a program, you don't have to declare it in the program itself, but rather in the template file. [Entirely nonsensical]
> Explicit template instantiation definitions can be put into header files, but they can't be put in source files. [Aside from not actually explaining the feature, this is more or less exactly backwards!]
The best it managed to output was a mediocre explanation of what a template is. I mean, it is about what I would expect a language model to interpolate from e.g. the StackOverflow corpus of questions tagged "C++" and "template", but it is a very far cry from being useful.
The quality of the code snippets was better, but still not at a usable level. Among outputs like `mytemplate.cpp`, `extern template` and compiler error messages, some snippets did correctly employ the syntax. It's clear that the model is selecting query-related code from query-related questions/tutorials, but it's still very hit and miss and not very focused. In my case, it certainly didn't "understand" the declaration/definition distinction, and even for most "good" snippets you first had to picture a bunch of surrounding code to make sense of it.
I'd say from a technical level the code snippet output is certainly impressive. But at this point I would have no reason to use your search engine over others for this narrow task of recalling the correct syntax for a language feature (or at least this one in particular - maybe it is too deep), because it involves just as much comparing snippets from various sources as opening the top three google results would - except without having any of the context. And if I didn't already know the topic quite well, none of the outputs (or even all 18 of them together) would have made me confident enough to say "ok, got it, this is the syntax I need to use".
Hello primarily does well for "how-to" questions at the moment; it's still early and we're working to improve results for the queries you've described.
[0] https://static.googleusercontent.com/media/research.google.c... [1] https://bootcamp.uxdesign.cc/the-hidden-insights-in-develope...
>The searches we anonymously log be used to improve our product.
>Your data will not be shared with any third party unless we are required to respond to subpoenas, court orders, or legal process, to establish or exercise our legal rights or defend against legal claims.
>We will never sell your data to any third party.
The first sentence is at odds with the second two.
If you say you are only collecting query and query response data, but then assure me you aren't selling my data, I can't help but wonder which is true:
1) The site is actually collecting data, but is not selling it. 2) The site is not collecting data, and the privacy page is outdated/incorrect/inaccurate.
The same question obviously applies to court orders and the like. Which is it? Do you store data that would be material to me if the site was presented by court order, and that is why you have given me a disclaimer? Or do you not store data, and so the disclaimer is meaningless?
1. In my work (also at UT actually: Hook 'em), we've found that the hallucination problem is, in part, lessened by over-parametrizing the model. Places that have the budget to do this have noticed that the performance of ml4code transformers increases linearly for every 1e3 increase in the number of parameters (with no drop off in sight). Love to hear your thoughts on this.
2. I'm concerned that finding code snippets from a short form query is underspecifing the problem too much and may not be the best user-interaction model. Let's compare your system to something like Github Copilot. I pass a query:
> how to normalize the rows of a tensor pytorch
With GitHub Copilot, I can demonstrate intent in the development environment itself with an IO example / comment / both and interact more efficiently. If I see errors in the synthesized snippet, I can change the query in >1 second etc. Etc. This is hard with a search engine style interactive environment. For this query, I had to navigate to the website, type in the query, check the results (which were wrong for me btw. Y'all might need to check correctness of the snippets), copy back the result, maybe go to the relevant thread and parse more closely etc. A good question to keep in mind here would be to figure out how to make this process more interactive.
3. Finally, I just want to say that the website is phenomenal, even on mobile. Kudos on the frontend/backend/architecture side of things.
Also, don't let my or anyone else's comments take away from the awesome work y'all have done!!! I pulled out that example from a paper I read recently called TF-coder. They have a dataset of these examples as part of their supplement material. All the best!
Do you have any idea of how you're going to go the "enterprise integration" route without hiring an army of implementation consultants?
best of luck, I'm sure there are many teams on Confluence that wish they had a functioning search without moving everything off Confluence at once
>I'm not sure what you mean by "load image in black and white ttf c++", but I can give you an example of how to do it. First, you'll need to convert the image to grayscale. You can do this by using the cvtColor function. Then, you can apply a binary threshold to the grayscale image. For example, if you have a color image with a value of 255 and a black value of 0, then you can do something like this: __
First impressions - really impressive! I did have to add "c++" to get any meaningful results though.
What would be useful is if you could present information from different docs sites in the same format and combine them.
Maybe that is a different startup idea completely but it would be cool to have MDN, React and Firebase docs in one place, brutalist style where I can quickly get to what I want.
I searched for "how to set up preact with vite", and I got a passage that sounded incredibly condescending, and half-way through it ignored the p in preact and started talking about react instead- but I was impressed that I couldn't work out if it was directly lifted from a stackoverflow reply, or it was generated in the style of one.
Congrats on the launch! I can vouch that this is indeed a problem devs face, especially new devs. The number of times I've had to steer away my students at Edyst away from SEO optimized articles!
Congrats on the Launch!
Edit: Privacy badger blocks api.bing.microsoft.com which breaks the search
Is this just forwarding the query to bing? How is it different to duck duck go?
IMO we need a code search engine that knows when to use or not use a lexeme index for each word or phrase.
This is something that will make me go back to Google.
Clicked on the result.
Then a blank empty page was shown.
Anyway, just want to comment the product niche. Vertical search engine? Interesting. Will we see another vertical for a search engine product?
#myDiv{ margin:0px auto; }