Ask HN: Would denormalizing a string prevent AI/LLM consumption?

1 pointsMattyRad2y ago1 comments

Hi. With burgeoning AI, I don't particularly like the idea of my persona being unwittingly scraped into an AI corpus.

Would denormalizing a string to unicode help prevent AI from matching text in a prompt? For example, changing "The quick brown fox" to "𝓣𝓱𝓮 𝓺𝓾𝓲𝓬𝓴 𝓫𝓻𝓸𝔀𝓷 𝓯𝓸𝔁" or "apple" to "ÁÞÞlé". Since the obfuscated strings use different tokens, they wouldn't match in a prompt, correct? And although normalization of strings is possible, would it be (im)possible to scale it in LLMs?

Note that I'm not suggesting that an AI couldn't produce obfuscated unicode, it can. This question is only about preventing one's text from aiding a corpus.

1 comments

1 comments · 1 top-level

PaulHoule2y ago

I was working on foundation models for business and we had done some work on character embeddings that would counteract that back in 2017.

Pro Tip: people whose ideas were worth stealing were worried about Google’s web scraping and the whole economy about it were unfair and exploitative 10 years ago. Suddenly the people whose ideas aren’t worth stealing are up in arms about it.

Think more about having ideas that are worth stealing (e.g. leading the herd not following the herd) instead of getting your ideas stolen.

j / k navigate · click thread line to collapse