- Commercial bias (content compared to the source, which it learns about)
- Insincere motives
- Bloat (how many words it takes to say how little to penalize SEO bloat)
I would assume that using LLMs, we can get a pretty good idea of what is SEO bloat and who the bad actors are by this point, and just penalize those results.
Your ideas for metrics are good, but LLMs seem to be quite terrible at any of these. A simple set of heuristics and maybe a tiny language model for named entity detection and "vibe checking" would serve you much better.
Also, a lot of the worst offenders seem to use the same Q&A +- conclusion structure, which Viktor from marginalia.nu wrote a simple heuristic for, which I recall he said did wonders for pruning it. Solving SEO spam is easy when you aren't the one being optimized against. What's left is scaling and information retrieval.
As of right now, LLMs are prolific but unreliable, which makes them extremely well suited for generating spam, but unsuited to detecting it without a large number of false positives and negatives.
For reranking to be able to detect commercial bias, insincerity or bloat you could use LLMs but IIRC you train a multiclass classifier for each and then combine the probabilities for each head(calibrate too?) into a score and use it in your ranking as weights?
Case in point, I asked an LLM what the last non cellular windows mobile classic PDA was. (I knew the answer) And it routinely got it wrong.
This is what LLMs should be useful for. If I cant audit the results or very how it came to the conclusion the answer is useless.
LLMs are toys at this juncture.
Mainstream web search is probably cooked. Kagi and other niche players might have a chance if the fact that they are not beholden to advertisers lets them introduce features to do things like downrank content with ads. Kagi has “small web” which I think includes this in its weighting.
Open social media is probably cooked too. In the future it’s going to require proof of human identity and will be more heavily moderated.
The future of social is closed forums. Even those are really having to fight bots though.
The other pervasive awful trend of the moment is everything becoming like John Deere tractors: cloud connected, DRMed, with subscriptions and/or planned obsolescence.
Capitalism is supposed to reward people for creating value, but today it seems like it’s far easier and more profitable to just extract rent or scam. I am not sure how to fix this.
We recently released a closed-system, iOS-only app, that has a fairly rudimentary, privacy-first system of user registration. It is currently restricted to the US and Canada.
Each signup request is manually vetted. There is no automatic registration. The app is designed for a specific demographic, and we do our best to ensure that new accounts are real people, that fit the demographic.
This is an iOS-only app (free), restricted to the North American continent, and with no accessible server API. The server is a bespoke server, and has no connections or dependencies that we don't control.
We are flooded with bots, and, most likely, scammers. So far, they have been pretty easy to spot, but that could change.
That's not to say it's a solved problem, even in the real world with thousands of years of battle tested strategies.
A very simple example would be a web browser where I could blacklist chronically-unhelpful sites, and share that metadata among friends.
1. People get on the web and make real content with care because they're just excited to share stuff with each other.
2. Advertisers talk them into putting some ads on their pages so they can get some compensation for their work.
3. Shitty people figure out you can just make content where the main incentive is to get people to go to the page and see the ads.
4. Those people then outsource writing the content to the lowest bidder.
5. The lowest bidder becomes an LLM.
6. Search engines cut out the middle-man entirely and just send your search query to an LLM, stuff some ads in, and show the result to the user without ever hitting the web (except to periodically scrape it for model training).
7. Because of 6, people stop putting new content on the web at all. The models get shittier and stupider with regards to current events.
8. To counter that, LLM companies make deals with news organizations and other primary source information provides and pay them to have direct access to content to train their models.
9. Those organizations get such a large fraction of their income from those deals that eventually they get out of the business of giving human readers direct access to it because it's not worth the effort. Newspapers become B2B companies.
10. The only way to get information is via a handful of giant tech companies sitting on top of huge LLMs saying who-knows-what trained on a slurry of actual information and giant piles of ads.
I hope that somewhere in the process people start to get tired of talking to machines all day and hop off the ride entirely and starting calling up their friends and getting information the old fashioned way.
The only consolation I have is the belief that people have a deep seated desire to connect to actual humans and know the real truth about the world.
I’ve started trying to pay for good journalism, especially good indie journalism. I also Patreon a bunch of podcasts, buy high quality software if the price is reasonable, buy albums of my favorite music, buy films, and so on, while actively avoiding both gratuitous subscription models and the ad web.
Pay for it or it either doesn’t get made or it pays for you. Free is a lie and piracy undermines quality.
Edit:
All the paying for good stuff I outlined above averages out to around $100-$150/month. It’s less than I usually spend on restaurants and coffee shops and far less than groceries for our family. Restaurants in particular feel like a far more frivolous expense.
IMO, the number of engineers and moderators needed to offset one scammer is about to take a huge dive.
[1]Either you pay a SaaS LLM provider or you pay the cost of compute to run the LLM yourself
I really hope we do not give up the internet's freedom as they suggest (and I doubt this would solve the spam problem).
Just imagine when you buy your SIM card the phone shop asks you: Do you want to limit incoming calls to people who you either called before or who have ever had a permanent residence in your country? 99% of spam and scam calls blocked, just like that.
And just imagine how hilarious it would be if all those Nigerian prince emails had a note that says "actually, the sender of this email has never been to Nigeria"
I am not saying this does not have upsides, but it would be a nightmare to have it imposwd on you.
Conversely, what if all those emails actually originate from Nigeria? Would it make them more legit?
Also, as long as search engines do their job, engagement on high quality pieces will always justify having a human write art
> as long as search engines do their job
lolololololol, funniest thing i've read all day.
I think abandoning anonymity is the only way forward and I think it’ll be glorious.
I remember in my 2005 high school political science class proposing fake legislation to require ID to go on the internet.
For the past 20 years I’ve shuttered at how awful an idea that would be and how naive I was as a kid.
But now I am curious how all of the externalities would play out.
* "Psychologists don't want to fix your problems because then you'll stop needing therapy every week."
* "Dating sites don't actually want you to find a long-term partner because then you'll stop using the site."
* "The mechanic's not trying to actually fix your car, just get it running for a few weeks so it breaks down again and you come back."
Etc. etc.
Any time there is information asymmetry and leaving a customer not fully satisfied might lead to future sales, this old canard comes up.
I'm sure in some cases it's true. But, like, people aren't entirely stupid. Consumers generally won't keep repeatedly going back to the same business if the service is kinda sucky. And businesses generally figure out that reputation matters and the most economically viable long-term strategy is just to give people what they want.
This isn't a law of nature. It's the result of particular conditions. Businesses in high-trust and low-trust cultures behave differently, and the descent of the US from a high-trust to a low-trust culture is going to have consequences.
Now think crime and police: without crime, the police would be out of a job.
A consultant has no interest that the project they are consulting on is ever completed.
Of course these examples aren’t to be taken seriously, they merely illustrate some potential conflicts of interest of roles within society.
This is where the theory falls apart. When “long-term strategy” and “short-term quarterly earnings” get into the boardroom together at a public company, it ain’t “long-term strategy” that’s walking out.
Your examples 1 & 3: I have personally witnessed those negative outcomes.
Regarding 3, A/C repair shops (drain system first, then discuss pricing) and transmission shops (disassemble first, then discuss pricing) are kind of notorious for it.
And yet there are mechanics and therapists who have earned my unquestioning trust.
I've not used 2. I may have bias.
We have the internet we pay for.
The root problem there is it's not practical in today's world to pay the small sums individual bits of content are actually worth. The smallest practical transact-able value is about a dollar. With service fees you can't go much below that value before you make no money or lose money. Even then it's really only practical for large scale players.
A single news article is not worth a dollar. A tweet or HN comment is worth nowhere near a dollar. Even if I found some HN comment worth money...how is my payment going to get to you and not eaten by HN?
Crypto bullshit is not the answer at all. It's worse for transactions in every way than regular money. It pretends to be a solution to micropayments by ignoring the very real and very onerous transaction fees and deflationary nature of the currency. There's been efforts to deal with micropayments but it's a hard problem. Paying individuals is difficult and transacting in practical (sub-cent values) is extremely difficult to do.
Ads are an imperfect but working-ish solution to micropayments. They allow the customer to "pay" with attention (though now with intrusive tracking) rather than currency. AdTech has gone bonkers with tracking and targeting and has gleefully participated in facilitating the Dead Internet.
In the age of copyright enforcement and DMCA antics, I don’t understand how this continues year after year.
One day the WHO found them and wholesale copied large chunks of their content verbatim into their online resources without asking or informing. They only found out because THEY went looking on the WHO website for the latest information on something.
At that point I started wondering just how much of big companies' work and content is just plagiarism and gratuitous theft from reputable but less visible or popular sources (also highly dependent on country, language, etc). And that was before I discovered hbomberguy on YouTube.
Thus, you have one document that's identical across dozens or hundreds of different hospital and medical system websites.
A long time ago, I held more weight in results from reputable places like Mayo Clinic, but even their site seems to be the same as all the others now.
problem is, how do you figure out someone's a bot? Ah well maybe someone will make an AI tool for that, we humans are too busy doing groceries and putting round cubes into square holes.
I bet we could get a major university (eg. Stanford) to help fund the initial deployment. I think we should call it a really big silly number, like "Gazillion" to emphasize how much it knows. Obviously this won't have ads, and it should make an explicit point to not be evil....
Hmm. I think I've heard this before.
The business model is very simple and requires 3 ingredients: web design, SEO, and copyediting. The first two are one-time costs. The 3rd one is a COGS and there is a whole market of professional blogwriters who charge something per 1000 words.
To answer your question, once you create a blog that starts printing money, it's in your best interest to just replicate it while changing as little of the "template" as possible, because you don't really know what element made it click.
Ie your "you don't really know what element made it click." above assumed Jodie was the fixed value, no matter what else.
That's what I don't get: I click on lego spam because I like lego, not because Jodie is cool and I love his wordsmithing. That follows on.
I'm probably being thick. Maybe it's because to the recipient Jodie is unique each time? It's "this is the least important part of it" so they don't change it because IT DOESN'T MATTER.
One name is twice as easy to invent as two names.
Because Jodie is everywhere and got your girl back home
https://taskandpurpose.com/military-life/brief-history-jody-...
Remember email spam? It got so bad, that we fixed it. I mean email has its issues and how but spam isn't one of them. I built a spam juggernaught in my day (got bills don't I :)) and I feel like I contributed a tiny bit to our almost-spamless latter days.
Progress! The world is on the march.
Which got me thinking: surely that would work for websites too? Why not let people report low quality of sites directly to the search engine? Kagi lets you to ban sites you find unhelpful, but it doesn't downgrade websites similar to the one you've banned. I shan't speak of Google practices as we all know them by now.
[1] https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
Sure, there are occasional unwanted marketing emails but those are easily dispatched with unsubscribe and/or inbox filters
I'm going to be that pedant and point out that the Internet is not the same as the Web, and it's the Web that's sick. The Internet is fine.
It's a distinction that matters because the Internet is expensive things like satellites and undersea cables. It's an investment that's too large to just walk away from, so perhaps its future is our future.
The Web is just a bunch of conventions about how to use the Internet, it's not binding in any way. We can write a different protocol without laying new cable, we can make it less profitable for abusers, and then we can abandon the sick version that we're currently using.
You could use the infrastructure for better and many people do. But most of the content on the internet isn’t that.
But it means the platform will soon be flooded with bot-generated spam.
That said, just because they can doesn't mean they will. It's possible that they've grown complacent and underfunded these capabilities (instead relying on community moderators to weed out bad actors). Or it's possible that they're too focused on the short term to see the existential risks. If I recall correctly, they couldn't resist the temptation of selling user content for LLM training, same as Stack Overflow.
But in an internet overrun by spam LLMs, the future are curated, walled-garden communities, and Reddit could be the basis for that.
> The only future of the internet is, sadly, proof-of-person and proof-of-residence on every public network interaction
Yeah, go away. The problem is that Google has monopolized web aggregation. Without it these sites wouldn't be worth making. I've got to show ID to post something online to kick that can down the road?
The way I see it this is a self resolving problem. Ad driven search engines rank ad driven blog spam, people get tired of it and use methods of finding content that don't show that garbage, these guys, along with their multinational trillion dollar benefactor, go out of business. Problem solved.
So how will you find things?
> Ad driven search engines rank ad driven blog spam
We don't have actual concrete proof that Google (or others) rank content with ads higher because of the ads.
As a contrary corollary, generally as society grows, we've seen an increase in "proof-of-person and proof-of-residence on every public X". Want welfare checks or charity, prove yourself. Want to buy cough syrup or booze, prove yourself. Want to drive, want to shoot a gun, want to XXXX.... get an ID.
In early America, men used to vote by everyone going into a big room and shouting for a while (some minor exaggeration). Now you have to register in advance and show ID, and they maintain registries of everyone and their affiliated party.
Showing identity comes with a lack of trust, and volume + anonymity decreases trust as it's slowly abused.
How do I find things. I'm already living life without much google in it. There are a lot of ways to find things. Aggregators like this one have a better signal to noise ratio than google or most places that publish a lot of information. There are search engines that actively blacklist anything with SEO in it. There are community groups that focus on topics of interest. I find that I only use big search engines nowadays to find a git repo for something or find out what time some place closes, that's all they're good for nowadays. I trust people more than faceless services, and I don't care anything about who any of those people are in real life.
https://mtbinsider.com/author/jodie-chiffey/
https://turfandtill.com/author/jodie-chiffey/
https://www.betterwander.com/author/jodie-chiffey/
https://artofgrill.com/author/jodie-chiffey/
https://theathleticfoot.com/author/jodie-chiffey/
https://altprotein.com/team-members/jodie-chiffey/
https://digitalguyde.com/us/
https://total3dprinting.org/author/jodie-chiffey/
Each of which is registered by NameCheap, who can never seem to kick their addiction to bottom feeders.Each of which is behind Cloudflare, the official latrine of planet dysentery.
disclosure: I use too.
There's still information in the noise, you have to become your own editor.
I woo women with my sensuous and godlike trombone playing, I can pilot bicycles up severe inclines with unflagging speed, and I cook Thirty-Minute Brownies in twenty minutes. I am an expert in stucco, a veteran in love, and an outlaw in Peru.
Using only a hoe and a large glass of water, I once single-handedly defended a small village in the Amazon Basin from a horde of ferocious army ants. I play bluegrass cello, I was scouted by the Mets, I am the subject of numerous documentaries. When I’m bored, I build large suspension bridges in my yard. I enjoy urban hang gliding. On Wednesdays, after school, I repair electrical appliances free of charge.
I am an abstract artist, a concrete analyst, and a ruthless bookie. Critics worldwide swoon over my original line of corduroy evening wear. I don’t perspire. I am a private citizen, yet I receive fan mail. I have been caller number nine and have won the weekend passes. Last summer I toured New Jersey with a traveling centrifugal-force demonstration. I bat 400. My deft floral arrangements have earned me fame in international botany circles. Children trust me.
I can hurl tennis rackets at small moving objects with deadly accuracy. I once read Paradise Lost, Moby Dick, and David Copperfield in one day and still had time to refurbish an entire dining room that evening. I know the exact location of every food item in the supermarket. I have performed several covert operations for the CIA. I sleep once a week; when I do sleep, I sleep in a chair. While on vacation in Canada, I successfully negotiated with a group of terrorists who had seized a small bakery. The laws of physics do not apply to me.
I balance, I weave, I dodge, I frolic, and my bills are all paid. On weekends, to let off steam, I participate in full-contact origami. Years ago I discovered the meaning of life but forgot to write it down. I have made extraordinary four course meals using only a mouli and a toaster oven. I breed prizewinning clams. I have won bullfights in San Juan, cliff-diving competitions in Sri Lanka, and spelling bees at the Kremlin. I have played Hamlet, I have performed open-heart surgery, and I have spoken with Elvis.
But I have not yet gone to college.
- well known missive from decades ago
Remember email spam? It got so bad, that we fixed it. I mean email has its issues and how but spam isn't one of them. I built a spam juggernaught in my day (got bills don't I :)) and I feel like I contributed a tiny bit to our almost-spamless latter days.
Progress! The world is on the march.
For now, at least we can still put "before:2020" in our search queries.
First, what jumped out at me when I read the article was that I had never heard of Jodie Chiffey before reading the article. I don't see all the trash the author is talking about because I don't spend any time on websites that promote it, or dealing with spam emails from sources of it, etc., etc. I can get along just fine with simply ignoring the existence of all the trash and going to websites (like this one) that I actually want to go to.
And how am I able to go to websites like this one without being overwhelmed by trash? Because (a) this website is managed by actual people who aren't interested in spamming me, and (b) HTTPS means I know that when I go to "news.ycombinator.com" I am going to a site managed by those people. In other words, we already have the actual solution to the "how do I avoid the trash" problem, and have had it for decades. We do not need to build a huge new spyware infrastructure to "fix" the web. We already have, and are using, the tools we need to avoid the trash.
This listed review spam from Dotdash Meredith which I used a clue to search and block all sites by them using uBlacklist.
Edit: Found another article saying the same thing there are only few big media giant's dominating search https://detailed.com/google-control/
I just want to point out Matt.sh apparently missed that Jodie’s a hardcore gamer and gaming expert as well as a trusted advisor to senior citizens. She also “… doesn’t want to hold back when it comes to getting the word out about affiliate marketing,” and will teach you her tricks. She is an avid toy insider too, loves toys from the 80’s. Not that bad looking either.
> We end up with real time global access from “low trust low income max exploitation because who is going to stop us” wankers to “high trust high income low questioning” societies and everything falls apart.
Obviously, there are "low trust high income max exploitation" folks doing the damage too, as one of the linked articles talks about frauds with wiring large sums (in 100ks of dollars) into Hong Kong, which is itself an expensive city.
Similarly, from the other direction, you've got high-income communities "exploiting" lower-income communities by getting cheaper labour (like getting things produced in China) — paying less than they would locally for the same or larger work effort.
The solution to this is obviously global equalization of salary bands, which is well under way due to an ability to do a lot of highly paid work remotely and globalization in general — but it will take some time (and it's also why China is becoming less appealing in particular: salaries are going up there as well). But that will lead to a new set of problems altogether.