story

Show HN: AskHN (opens in new tab)

patterns.app

612 pointskvh3y ago127 comments

127 comments

As is often true of GPT responses, there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:

> if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot...The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)

[0] https://www.skynettoday.com/editorials/humans-not-concentrat...

leereeves3y ago

> there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

As a programmer, I find R hard to use and not very well designed, so I can see why you'd call that nonsense.

But when I was a math student, I found that in some ways R does have "a more mature package universe". There were many math algorithms that I could find packages for in R and not in Python, even as a mere grad student.

setgree3y ago

Absolutely, for statistics and visualization I think R and its packages are (sometimes) superior. But GPT responses don't generally offer those kinds of nuances; the claim is that the packages are "more mature," period. And it's for good reason that the _most_ mature Python packages, e.g. numpy and pandas, are used by data scientists in production pretty much everywhere.

2 more replies

shanusmagnus3y ago

I wonder if the "default to humility" heuristic probably does more harm than good on net, because the people who heed it probably shouldn't, and the ones who should won't.

nine_k3y ago

Default to humility. Do not assume you're so smart that you can skim the text and understand it correctly. Read every word, don't assume that the author is so predictable that you can guess correctly.

Why, does not sound too arrogant to me.

q1w23y ago

I think it's important to remember that Humans who are not-too-smart can also sound coherent, yet also babble complete nonsense.

My experience with ChatGPT thus far is that it is as intelligent as a very broadly read person who just doesn't reeeally get the complex or nuanced aspects of the content it reads - much like many real Humans.

setgree3y ago

Robin Hanson makes this point in better babblers http://www.overcomingbias.com/2017/03/better-babblers.html

"After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer."

meken3y ago

Reminds me how when people get criticized on Twitter now, they just assume it’s a bot

have_faith3y ago

It seems to write in the generic "style" of GPT, instead of in the style I would recognise as a HN poster. Is that because of something baked into how the training process works? It lacks a sort of casualness or air of superiority ;)

sebzim45003y ago

There was no training process, this is just running GPT with relevant HN comments as part of the prompt.

If he wanted it to replicate that classic HN feel he would either have to extend the prompt with additional examples or, better yet, use finetuning.

I guess he could also just randomly sprinkle in some terms like 'stochastic parrot' and find a way to shoehorn Tesla FSD into every conversation about AI.

btbuildem3y ago

> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

First sentence of the first paragraph on OP's page

EDIT: it's a bit misleading, further down they describe what looks like a semantic-search approach

1 more reply

rpastuszak3y ago

Last year (pre the chatGPT bonanza) I was using GPT-3 to generate some content about attribution bias and the responses got much spicier once the prompt started including the typical HN poster lingo, like "10x developer":

https://sonnet.io/posts/emotive-conjugation/#:~:text=I%27m%2...

My conclusion was that you can use LLMs to automate and scale attribution bias.

We did it guys!

britzkopf3y ago

To truly capture the HN experience, the user should provide a parameter for the number of "well actually"'s they want to receive. So initial response should demonstrate clear expertise and make a great concise point in response to question, and then start the cascade of silly nitpicking.

bradwood3y ago

I think you'll find "I think you'll find" trumps "well actually".

;)

1 more reply

clark-kent3y ago

> ii. Compute embeddings and similarity and choose top K comments closest to question

> iii. Put top matching comments into a prompt and ask GPT-3 to answer the question using the context

It depends on the Prompt used to ask GPT the question. A prompt that instructs GPT to write like a HN poster should fix that.

cookie_monsta3y ago

There also needs to be at least one question mark at the end of a statement?

reacharavindh3y ago

Now that you say it, it will train itself for it while it learns from your comments ;-)

georgelyon3y ago

Am I correct in understanding that this doesn't actually generate answers based on HN, but instead finds semantically-near comments and sends them verbatim to GPT to summarize? Seems like a good enough hack, though I'd love a detailed writeup of how to actually specialize an existing LLM with additional training data (like HN).

ilaksh3y ago

Technically it does give a specific answer to the question, but it is based on the semantically similar comments (and the question).

The thing people don't realize is that right now there is a very large gap between the capabilities of a few models including OpenAI's most recent ones, and most of the other LLMs. So there are several options for actually training or fine-tuning with open models, but actually none of them have the language understanding and generation capabilities at the level of those new OpenAI models.

As far as I know.

jerpint3y ago

The summary itself is still generated, but has all the context to do summarization in the prompt.

It's very difficult to otherwise finetune existing LLMs. GPT itself is closed-sourced, and doesn't allow for finetuning (except via an opaque API and with limited amounts of data). Other open models are either very difficult to load in memory and/or simply not as expressive as GPT

serjester3y ago

Agreed, I think the better approach is to do some custom tuning but that becomes cost prohibitive very quickly. Not really much different than Algolia with a minor GPT-3 integration but neat project regardless.

redox993y ago

You can literally finetune these OpenAI models using their API. In this case it probably wasn't done because the author found it too much work and/or too expensive.

moremetadata3y ago

>the author found it too much work and/or too expensive.

Military budgets and manpower would make sock puppets interesting. https://www.theguardian.com/technology/2011/mar/17/us-spy-op...

Does anyone find it curious Edward Snowden is mute on the military's use of this technology?

How many people would be surprised to learn they are communicating with a program on websites like this?

flappysolutions3y ago

I'm quite pleasantly surprised by this response:

Question: what are the biggest fallacies of people that post on hackernews?

Answer: One of the biggest fallacies of people that post on Hacker News is the nirvana fallacy, which is when someone jumps from "thing x is imperfect" to "thing x is worthless" without considering any of the benefits. People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology. Another fallacy is the assumption that there is one best tool for doing things - the one that supports the largest amount of use cases. Additionally, people may use appeals to authority, invent their own solutions rather than use existing ones, and assume that the approaches they first learned are most correct. Finally, people may assume open source software is a gift and that dependencies don't need to be tested.

autoexec3y ago

> People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology.

Is that really a problem? It's only natural to get accustomed to old problems caused by old tech (where workarounds probably already exist or limitations understood and accepted) while it makes sense to hold new tech to a higher standard since it has the benefit of learning from past mistakes and should avoid introducing obvious problems for people in the future to deal with. Our future selves will thank us for being more critical of flawed new tech than we are of old tech.

ducktective3y ago

It kinda enumerated all possible characteristics of people and their approaches to technical issues, not something unique to HN.

LeoPanthera3y ago

I'm a little surprised that Hacker News comments weren't already in the GPT-3 training set. I just assumed that OpenAI had vacuumed up most of the web already.

retube3y ago

I am guessing they already were? But this is 100% pure, concentrated HN not contaminated with nonsense from the rest of the web :)

bityard3y ago

If it's really trained exclusively off of HN comments, I expect most of the bot's responses will evade the actual question but spend several paragraphs debating the factual specifics of every possible related tangential point, followed by an thinly-veiled insult questioning the user's true motivations.

4 more replies

MuffinFlavored3y ago

Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?

3 more replies

nkozyra3y ago

I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.

(edited for clarity)

1 more reply

jerpint3y ago

Nice! We built something very similar recently, it is more like "Ask your documentation" but very similar implementations otherwise.

See a demo on the huggingface transformers documentation: https://huggingface.co/spaces/jerpint/buster

code: https://github.com/jerpint/buster

freeqaz3y ago

Starred! We've been looking to build something similar so I appreciate you sharing this here.

The only other project that I've seen that's doing something close to this is this one: https://github.com/getbuff/Buff

It's a bit more similar to the OPs bot (it's a Discord bit).

Cool to see momentum in this space!

jn2clark3y ago

As did we! It works very well. Article: https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT... and code:https://github.com/marqo-ai/marqo/tree/mainline/examples/GPT...

boywitharupee3y ago

For those who are wondering,

HN data is indexed with embeddings for semantic search. When queried, it finds closest article, top comments and summarizes with GPT-3.

GPT-3 serves as a rendering tool for compressed comments.

btbuildem3y ago

My own experiments made me think that the impact of finetuning is comparable to that of a molecule in a drop in a bucket.

> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

I'm assuming you used the openai fine-tuning pathway to make a custom model?

Have you tested the responses on vanilla GPT3 vs your custom model?

I'd be curious to see the comparison.

clark-kent3y ago

From the article, they did not use fine-tuning. This is semantic search + GPT-3 to provide human-like answers.

btbuildem3y ago

Thanks! I missed that part.

The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.

danuker3y ago

Yeah, to me it looks like the learning rate was way too low to make a difference.

I don't see any of the sublime and succinct snark.

notahacker3y ago

Yeah. Also full of GPT-3isms like "ultimately the choice ... comes down to the specific project and its ... requirements" and not nearly contrarian enough

A bot focused on the output of HNers would insist on providing arguments against going through Google's interview process in the first place and suggestions that the correct answer to "Python or R" should be Haskell or Julia and would never suggest prioritising emotional vulnerability or being a happy person!

1 more reply

SilverBirch3y ago

This might be a dumb question, but is this based on the collective wisdom of HN. Because I would say that the collective wisdom is just as much in the interaction of the comments and the ranking of those comments as it is in the comments themselves. If you just injest all the comments wholesale, aren't you rather getting the average wisdom of HN?

inportb3y ago

I believe it's always going to be an average. The more interesting question is how is the average weighted?

fleddr3y ago

Let's admit that HN's culture is that many of us are confidentially wrong, which we cover up with impressive technical jargon. As such, any wrong answer in this AI is in fact correct.

Sai_3y ago

> confidentially wrong, which we cover up with impressive technical jargon

I get the feeling this comment is self-referential/self-parodying.

fleddr3y ago

Well spotted ;)

MikeTheRocker3y ago

I love this! I used to append "reddit" to my Google search queries to get best results, but the quality of dialog over there has really dropped in recent years. These days I've switched to appending "hackernews", but this is even better.

leobg3y ago

Same. I have “site:news.ycombinator.com” as a keyboard shortcut on my phone. Use it all the time.

monkeydust3y ago

Nice work! Been playing with Langchain and was not aware of patterns.app.

This whole space is moving so fast its hard to keep up for someone whos immediate day job doesn't revolve around this space. Congrats.

egypturnash3y ago

Is there a way to opt out of one's comments being used for this?

anaganisk3y ago

Nah, it's no big a deal, its not like cambridge analytica will happen again. They're just using your data to train AI. Who knows may be based on the way you comment, you may get suggestions on which medication you need, or if it's time for the Redbull/starbucks coffee. Nah, all is good. Nothing bad will happen in allowing companies to scrape comments and build models. They're very ethical. In fact, people here are suddenly not so concerned that the model is not open. There is no oversight on how data is being used They are just proud to get answers from a text generator.

olivierduval3y ago

The BIG DEAL is not THAT specific instance but the fact that the ML crowd think it's OK to take everything without even asking permission

3 more replies

Minor49er3y ago

Just post a healthy amount of random nonsense along with any of your actual posts to dilute the effects

Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos

triyambakam3y ago

> Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos

It's actually more likely to require a bathtub to increase the volume of the reticulated lorries, so I really don't think a farmer's market is the ideal place.

hombre_fatal3y ago

Yes, don’t post on online forums.

samstave3y ago

Thats how I decided to opt-out of reddit after 16 years.

jdthedisciple3y ago

Why would you want to? Genuinely wondering.

I for one am oh so proud that my valuable ramblings contributed to this majestic machinery.

olivierduval3y ago

I agree: when I signed in, I never agreed to let anybody use what I write to do anything they want ! I only agreed to let everybody read, understand, interact with what I wrote

Actually, it makes me feel as bad as knowing that CAPTCHA were used to train image recognition models...

I think it could be a good time to reconsider the question of the consent. I may agree that my words are used to train some IA... but 1) I must be asked (kindly) first and 2) it won't be free!!! (it may be paid to me or the service provider like HN... but it's NOT unpaid work ;-) )

1 more reply

andai3y ago

If you're willing to pay for the retraining? ;)

PetrBrzyBrzek3y ago

Hi, thanks for the interesting article. I have a question about Pinecone. What is the cost of storing all these vectors?

kerogerokero3y ago

Anyone here know how to generate subtitles automatically using AI when a video is playing on the web?

Was planning to see how I can build something like these sites but without the need to regularly update the subtitles catalog: https://subscene.be https://subtitlecat.com https://subtitletrans.com https://my-subs.co

1 more reply

la647103y ago

Is there any LLM model that can be self hosted and fed a corpus of data to ingest for question answering? The part I find difficult is how to feed (not train) the open LLM models with entire dataset not available to public?

bayan12343y ago

The hack to solve this is to embed each paragraph in your large corpus. Find paragraphs most similar to the user query using embeddings. Put the paragraphs and the raw user query into a prompt template. Send the final generated prompt to gpt3.

This actually works surprisingly well.

Check out the OpenAI cookbook for examples.

renewiltord3y ago

ChatGPT and friends always talk like those Microsoft and Apple forum responders with 100k reputation.

I see that you are asking about "How to get a job at Google". I will help you with "How to get a job at Google". In order to solve the problem of "How to get a job at Google" please follow the following steps first:

- rewrite your resume in Google Docs

- reinstall Chrome

- apply to the job

Let me know if I can help further with "How to get a job at Google". I like using it, but I have to tune my prompts to make sure that they don't bullshit me before getting to the point.

leobg3y ago

I like the project. Had been wanting to do this myself for a long time, because HN has become the first place I go to nowadays for answers, and I value the intelligence and experience distilled in the comments here.

I do not like that it seems to be effectively an ad.

> Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000).

Does anybody understand what he’s talking about here? Assuming 6.5 million comments and an average token length of 70 we'd be looking at $180 ($0.0004 / 1K tokens).

motohagiography3y ago

Nice. I just sort of assumed early on my comments were training some future AI, and I hope that in some small way I have been able to moderate some of its stupider urges.

A version where you can turn knobs of flavored contributors would be pretty funny. I know my comment style is easily identifiable and reproducable, and it encodes a certain type of logical conjugation, albeit biased with some principles and trigger topics, and I think there is enough material on HN that there may be such a thing as a distinct, motohagiographic lens. :)

_8j503y ago

Some day I will sue people like OP (if they're monetizing it) and OpenAI for monetizing my public posts. You can use, reuse and alter public speech but when you earn ad dollars...yeah part of that is mine if your model used my public content. I probably won't actually sue but someone will.

I am not a lawyer but there has to be a jurisdiction where I can establish standing at least.

gnicholas3y ago

Love that it includes sources — this makes it much more valuable because you can tell if it's giving useful information or just blowing smoke.

clark-kent3y ago

> 4. Index the embeddings in a database

If Op is reading. I'm curious about the database you are using to store the embeddings. Pinecone, Weaviate ...?

gk13y ago

From the article:

> The embeddings were then indexed with Pinecone.

DanielBMarkham3y ago

Related question: I've written probably a million words over my lifetime.

Is there an easy way to load up GPT with my thoughts to have it be a fake me?

mittermayr3y ago

This I think would be a great little SaaS idea to make some money. I keep seeing more and more people askingn how they can transform their data into an interactive archive that responds as chat, or with voice.

ed_mercer3y ago

Sort of. Look into gpt-index/Langchain

einpoklum3y ago

> The methodology I used here is a generic, scalable solution for distilling a knowledge corpus into an embodied intelligence

The methodology used here is a generic solution for distilling a non-generic corpus of utterances of , into a generic platitude machine.

osigurdson3y ago

I have an experiment that uses the embeddings to visualize clusterings of HN comments (using tsne). Not super useful but interesting to view the comments in 3D and seeing how similar ones cluster together into mostly relevant themes.

ada19813y ago

Hmm. I thought perhaps he was going to take the questions from askHN and the top upvoted comments and fine tuning a model with that as the prompt / reply pair.

Curious how that would differ; but would be an expensive endevour.

jaequery3y ago

Seeing a ton of projects utilizing chatgpt nowadays. Are the project owners basically paying the API costs out of pocket ? Think it would add up pretty quick especially if front page on HN.

wastu3y ago

Instead of being summarize tool, this bot is really useful if you want to search related hn post based on abstract thinking imo.

underlines3y ago

Just to be sure: This is NOT a finetuned GTP model, but rather standard GPT-3 API, used to summarize search results of a HN Comments DB, based on user input. Right?

tosemlep3y ago

Did you also ingest dead comments to the corpus?

I would very much like to see the ghost of Terry pop up from time to time, to offer his wisdom and unique style of response.

MuffinFlavored3y ago

> I trained on a corpus of over 6.5 million Hacker News comments

How long did it take to scrape them and train the "corpus" on this content?

tta3y ago

https://console.cloud.google.com/marketplace/details/y-combi...

SalimoS3y ago

That was mentioned in the article In the « Ingesting and filtering HN corpus » … 30min …

adversaryIdiot3y ago

I Didn't know the api supported downloading all of its database. Are you the reason HN has sporadic downtime lately? ;)

Existenceblinks3y ago

Ask HN: here is my idea, can I build this in a weekend

AI: of course .. here is your bash script (220 lines long)

bilekas3y ago

"He only went and did it... " !

dalmo33y ago

This is nice! The official algolia search is useless.

Otoh, did I miss something or is it only on discord?

cactusplant73743y ago

I really like Algolia. I usually use it to see if a particular link has been submitted. Other times I use it to find relevant comments or posts.

smsm423y ago

How to get a job at Google? Oh, that's easy, just get a PhD.

Thanks bottie, very use, much helpful.

helsontaveras183y ago

Now that we have this bot to answer questions for us, I think we can all go home!

AndrewKemendo3y ago

First thing I saw is my answers to someones question.

Can you cut me a distro of myself?

hbn3y ago

Amazing, an AI that is incapable of picking up on jokes or sarcasm!

m3kw93y ago

I thought chatgpt may already used hacker news (Reddit) to train?

sorokod3y ago

the collective wisdom of the HN community

Made me smile

tomr753y ago

could you do this for medical journal articles?

all23y ago

You'd probably need to prepend a prompt that told the bot how to analyze experiment design. Maybe have it read a book or 10 on experiment design. Also a few books on social networks, financial motivations and other human factors in science. Then let it take a look at journal articles and their metadata. In short, you need a way to vet for quality.

pknerd3y ago

Can anyone help me to guide some tutorials using GPT-3 model on a certain dataset. I am a Python programmer.

j / k navigate · click thread line to collapse

127 comments

setgree3y ago

As is often true of GPT responses, there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:

[0] https://www.skynettoday.com/editorials/humans-not-concentrat...

leereeves3y ago

> there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

As a programmer, I find R hard to use and not very well designed, so I can see why you'd call that nonsense.

setgree3y ago

2 more replies

shanusmagnus3y ago

I wonder if the "default to humility" heuristic probably does more harm than good on net, because the people who heed it probably shouldn't, and the ones who should won't.

nine_k3y ago

Why, does not sound too arrogant to me.

q1w23y ago

I think it's important to remember that Humans who are not-too-smart can also sound coherent, yet also babble complete nonsense.

setgree3y ago

Robin Hanson makes this point in better babblers http://www.overcomingbias.com/2017/03/better-babblers.html

meken3y ago

Reminds me how when people get criticized on Twitter now, they just assume it’s a bot

have_faith3y ago

sebzim45003y ago

There was no training process, this is just running GPT with relevant HN comments as part of the prompt.

If he wanted it to replicate that classic HN feel he would either have to extend the prompt with additional examples or, better yet, use finetuning.

I guess he could also just randomly sprinkle in some terms like 'stochastic parrot' and find a way to shoehorn Tesla FSD into every conversation about AI.

btbuildem3y ago

> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

First sentence of the first paragraph on OP's page

EDIT: it's a bit misleading, further down they describe what looks like a semantic-search approach

1 more reply

rpastuszak3y ago

https://sonnet.io/posts/emotive-conjugation/#:~:text=I%27m%2...

My conclusion was that you can use LLMs to automate and scale attribution bias.

We did it guys!

britzkopf3y ago

bradwood3y ago

I think you'll find "I think you'll find" trumps "well actually".

;)

1 more reply

clark-kent3y ago

> ii. Compute embeddings and similarity and choose top K comments closest to question

> iii. Put top matching comments into a prompt and ask GPT-3 to answer the question using the context

It depends on the Prompt used to ask GPT the question. A prompt that instructs GPT to write like a HN poster should fix that.

cookie_monsta3y ago

There also needs to be at least one question mark at the end of a statement?

reacharavindh3y ago

Now that you say it, it will train itself for it while it learns from your comments ;-)

georgelyon3y ago

ilaksh3y ago

Technically it does give a specific answer to the question, but it is based on the semantically similar comments (and the question).

As far as I know.

jerpint3y ago

The summary itself is still generated, but has all the context to do summarization in the prompt.

serjester3y ago

redox993y ago

You can literally finetune these OpenAI models using their API. In this case it probably wasn't done because the author found it too much work and/or too expensive.

moremetadata3y ago

>the author found it too much work and/or too expensive.

Military budgets and manpower would make sock puppets interesting. https://www.theguardian.com/technology/2011/mar/17/us-spy-op...

Does anyone find it curious Edward Snowden is mute on the military's use of this technology?

How many people would be surprised to learn they are communicating with a program on websites like this?

flappysolutions3y ago

I'm quite pleasantly surprised by this response:

Question: what are the biggest fallacies of people that post on hackernews?

autoexec3y ago

> People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology.

ducktective3y ago

It kinda enumerated all possible characteristics of people and their approaches to technical issues, not something unique to HN.

LeoPanthera3y ago

I'm a little surprised that Hacker News comments weren't already in the GPT-3 training set. I just assumed that OpenAI had vacuumed up most of the web already.

retube3y ago

I am guessing they already were? But this is 100% pure, concentrated HN not contaminated with nonsense from the rest of the web :)

bityard3y ago

4 more replies

MuffinFlavored3y ago

Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?

3 more replies

nkozyra3y ago

I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.

(edited for clarity)

1 more reply

jerpint3y ago

Nice! We built something very similar recently, it is more like "Ask your documentation" but very similar implementations otherwise.

See a demo on the huggingface transformers documentation: https://huggingface.co/spaces/jerpint/buster

code: https://github.com/jerpint/buster

freeqaz3y ago

Starred! We've been looking to build something similar so I appreciate you sharing this here.

The only other project that I've seen that's doing something close to this is this one: https://github.com/getbuff/Buff

It's a bit more similar to the OPs bot (it's a Discord bit).

Cool to see momentum in this space!

jn2clark3y ago

As did we! It works very well. Article: https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT... and code:https://github.com/marqo-ai/marqo/tree/mainline/examples/GPT...

boywitharupee3y ago

For those who are wondering,

HN data is indexed with embeddings for semantic search. When queried, it finds closest article, top comments and summarizes with GPT-3.

GPT-3 serves as a rendering tool for compressed comments.

btbuildem3y ago

My own experiments made me think that the impact of finetuning is comparable to that of a molecule in a drop in a bucket.

> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

I'm assuming you used the openai fine-tuning pathway to make a custom model?

Have you tested the responses on vanilla GPT3 vs your custom model?

I'd be curious to see the comparison.

clark-kent3y ago

From the article, they did not use fine-tuning. This is semantic search + GPT-3 to provide human-like answers.

btbuildem3y ago

Thanks! I missed that part.

The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.

danuker3y ago

Yeah, to me it looks like the learning rate was way too low to make a difference.

I don't see any of the sublime and succinct snark.

notahacker3y ago

Yeah. Also full of GPT-3isms like "ultimately the choice ... comes down to the specific project and its ... requirements" and not nearly contrarian enough

1 more reply

SilverBirch3y ago

inportb3y ago

I believe it's always going to be an average. The more interesting question is how is the average weighted?

fleddr3y ago

Let's admit that HN's culture is that many of us are confidentially wrong, which we cover up with impressive technical jargon. As such, any wrong answer in this AI is in fact correct.

Sai_3y ago

> confidentially wrong, which we cover up with impressive technical jargon

I get the feeling this comment is self-referential/self-parodying.

fleddr3y ago

Well spotted ;)

MikeTheRocker3y ago

leobg3y ago

Same. I have “site:news.ycombinator.com” as a keyboard shortcut on my phone. Use it all the time.

monkeydust3y ago

Nice work! Been playing with Langchain and was not aware of patterns.app.

This whole space is moving so fast its hard to keep up for someone whos immediate day job doesn't revolve around this space. Congrats.

egypturnash3y ago

Is there a way to opt out of one's comments being used for this?

anaganisk3y ago

olivierduval3y ago

The BIG DEAL is not THAT specific instance but the fact that the ML crowd think it's OK to take everything without even asking permission

3 more replies

Minor49er3y ago

Just post a healthy amount of random nonsense along with any of your actual posts to dilute the effects

Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos

triyambakam3y ago

> Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos

It's actually more likely to require a bathtub to increase the volume of the reticulated lorries, so I really don't think a farmer's market is the ideal place.

hombre_fatal3y ago

Yes, don’t post on online forums.

samstave3y ago

Thats how I decided to opt-out of reddit after 16 years.

jdthedisciple3y ago

Why would you want to? Genuinely wondering.

I for one am oh so proud that my valuable ramblings contributed to this majestic machinery.

olivierduval3y ago

I agree: when I signed in, I never agreed to let anybody use what I write to do anything they want ! I only agreed to let everybody read, understand, interact with what I wrote

Actually, it makes me feel as bad as knowing that CAPTCHA were used to train image recognition models...

1 more reply

andai3y ago

If you're willing to pay for the retraining? ;)

PetrBrzyBrzek3y ago

Hi, thanks for the interesting article. I have a question about Pinecone. What is the cost of storing all these vectors?

kerogerokero3y ago

Anyone here know how to generate subtitles automatically using AI when a video is playing on the web?

1 more reply

la647103y ago

bayan12343y ago

This actually works surprisingly well.

Check out the OpenAI cookbook for examples.

renewiltord3y ago

ChatGPT and friends always talk like those Microsoft and Apple forum responders with 100k reputation.

- rewrite your resume in Google Docs

- reinstall Chrome

- apply to the job

Let me know if I can help further with "How to get a job at Google". I like using it, but I have to tune my prompts to make sure that they don't bullshit me before getting to the point.

leobg3y ago

I do not like that it seems to be effectively an ad.

> Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000).

Does anybody understand what he’s talking about here? Assuming 6.5 million comments and an average token length of 70 we'd be looking at $180 ($0.0004 / 1K tokens).

motohagiography3y ago

Nice. I just sort of assumed early on my comments were training some future AI, and I hope that in some small way I have been able to moderate some of its stupider urges.

_8j503y ago

I am not a lawyer but there has to be a jurisdiction where I can establish standing at least.

gnicholas3y ago

Love that it includes sources — this makes it much more valuable because you can tell if it's giving useful information or just blowing smoke.

clark-kent3y ago

> 4. Index the embeddings in a database

If Op is reading. I'm curious about the database you are using to store the embeddings. Pinecone, Weaviate ...?

gk13y ago

From the article:

> The embeddings were then indexed with Pinecone.

DanielBMarkham3y ago

Related question: I've written probably a million words over my lifetime.

Is there an easy way to load up GPT with my thoughts to have it be a fake me?

mittermayr3y ago

ed_mercer3y ago

Sort of. Look into gpt-index/Langchain

einpoklum3y ago

> The methodology I used here is a generic, scalable solution for distilling a knowledge corpus into an embodied intelligence

The methodology used here is a generic solution for distilling a non-generic corpus of utterances of , into a generic platitude machine.

osigurdson3y ago

ada19813y ago

Hmm. I thought perhaps he was going to take the questions from askHN and the top upvoted comments and fine tuning a model with that as the prompt / reply pair.

Curious how that would differ; but would be an expensive endevour.

jaequery3y ago

Seeing a ton of projects utilizing chatgpt nowadays. Are the project owners basically paying the API costs out of pocket ? Think it would add up pretty quick especially if front page on HN.

wastu3y ago

Instead of being summarize tool, this bot is really useful if you want to search related hn post based on abstract thinking imo.

underlines3y ago

Just to be sure: This is NOT a finetuned GTP model, but rather standard GPT-3 API, used to summarize search results of a HN Comments DB, based on user input. Right?

tosemlep3y ago

Did you also ingest dead comments to the corpus?

I would very much like to see the ghost of Terry pop up from time to time, to offer his wisdom and unique style of response.

MuffinFlavored3y ago

> I trained on a corpus of over 6.5 million Hacker News comments

How long did it take to scrape them and train the "corpus" on this content?

tta3y ago

https://console.cloud.google.com/marketplace/details/y-combi...

SalimoS3y ago

That was mentioned in the article In the « Ingesting and filtering HN corpus » … 30min …

adversaryIdiot3y ago

I Didn't know the api supported downloading all of its database. Are you the reason HN has sporadic downtime lately? ;)