undefined | Better HN

0 pointsandai1y ago0 comments

It seems to come down to keyword expansion, though I'd be curious if there's more to it than just asking "please generate relevant keywords".

0 comments

sdesol1y ago

Something that I'm working on is making it easy to fix spelling and grammatical errors in documents that can affect BM25 and embeddings. So in addition to generating keyword/metadata with LLM, you could also ask it to clean the document; however, based on what I've learned so far, fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this.

andaiOP1y ago

Fascinating. I think the process could be automated, though I don't know if it's been invented yet. You would want to use the existing autocomplete tech (probabilistic models based on Levenshtein distance and letter proximity on keyboard?) in combination with actually understanding the context of the article and using that to select the right correction. Actually, it sounds fairly trivial to slap those two together, and the 2nd half sounds like something a humble BERT could handle? (I've heard people getting great results with BERTs in current year, though they usually fine-tune them on their particular domain.)

I actually think even BERT could be overkill here -- I have a half-baked prototype of a keyword expansion system that should do the trick here. The idea is is to construct a data structure of keywords ahead of time (e.g. by data-mining some portion of Common Crawl), where each keyword has "neighbors" -- words that often appear together and (sometimes, but not always) signal relatedness. I didn't take the concept very far yet, but I give it better than even odds! (Especially if the resulting data structure is pruned by a half-decent LLM -- my initial attempts resulted in a lot of questionable "neighbors" -- though I had a fairly small dataset so it's likely I was largely looking at noise.)

sdesol1y ago

> I think the process could be automated

It can definitely be automated in my opinion, if you go with a supermajority workflow. Something that I've noticed with LLMs is it's very unlikely for all high-quality LLM models to be wrong at the same time. So if you go by a supermajority, the changes are almost certainly valid.

Having said all of that, I still believe we are not addressing the root cause of bad searches which is "garbage in, garbage out". I strongly believe the true calling for LLM will be to help us curate and manage data, at scale.

firejake3081y ago

> fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this

This is an interesting observation to me. I would have expected that, since LLMs evolved from autocomplete/autocorrect algorithms, correcting spelling mistakes would be one of their strong suits. Do you have examples of cases where they fail?

sdesol1y ago

If you look at my post history, you can see an example of how claude and openai can not tell that GitHub is spelled correctly. The end result won't make a difference but it raises questions regarding how else it can misinterpret things.

At this moment I would not trust AI to automatically make changes.

1 more reply

j / k navigate · click thread line to collapse

0 comments

sdesol1y ago

andaiOP1y ago

sdesol1y ago

> I think the process could be automated

firejake3081y ago

> fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this

sdesol1y ago

At this moment I would not trust AI to automatically make changes.

1 more reply

j / k navigate · click thread line to collapse