Very soon, the domain of bullshit will extend to actual text. We'll be able to buy HN comments by the thousand -- expertly wordsmithed, lucid AI comments -- and you can get them to say "this GitHub repo is the best", or "this startup is the real deal". Won't that be fun?
You're forgetting the millions of additional comments that will be written by humans to trick the AI into promoting their content.
Even worse, currently if you ask Chat GPT to write you some code, it will make up an API endpoint that doesn't exist and then make up a URL that doesn't exist where you can register for an API key. People are already registering these domains, and parking fake sites on them to scam people. ChatGPT is creating a huge market for creating fake companies to match the fake information it's generating.
The biggest risk may not be people using AI-generated comments to promote their own repos, but rather registering new repos to match the fake ones that the AI is already promoting.
Does ChatGPT consistently generate the same fake data though?
I don't think an arms race for convincing looking bullshit is going to turn out well for our species.
The obvious problem is we don’t have any great alternatives. We have captcha, and we can look at behavior and source data (IP), and of course everyone’s favorite fingerprinting. To make matters worse: abuse, spam and fraud prevention lives in the same security-by-obscurity paradigm that cyber security lived in for decades before “we” collectively gave up on it, and decided that openness is better. People would laugh at you to suggest abuse tech should be open (“you’d just help the spammers”).
I tried to find whether academia has taken a stab at these problems but came up pretty much empty handed. Hopefully I’m just bad at searching. I truly don’t get why people aren’t looking at these issues seriously and systematically.
In the medium term, I’m worried that we’ll not address the systemic threats, and continue to throw ID checks, heuristics and ML at the wall, enjoying the short lived successes when some classifier works for a month before it’s defeated. The reason this is concerning is that we will be neck deep in crap (think SEO blogspam and recipe sites but for everything) which will be disorienting for long enough to erode a lot of trust that we could really use right now.
There's always identity based network of trust. Several other members vouch for new people to be included.
I can see lots of reaosns people might oppose the idea but I am not sure why it's not a widely discussed option?
(asking honestly and openly - please don't shout!)
Of course we do. The rise of digital finance services has led to creation of a number of servives that offer identity verification necessary for KYC. All such services offer APIs, so adding an identity verification requirement to your forum is trivial.
Of course, if it isn't obvious, I'm only half joking.
Of course it's not always easy to say what's AI-generated or not. But if an account is making a habit of it, it still seems possible to tell.
Phone, then ID-based verification is a stop gap, but IDV services will have to spin up to support the mass volume of verifying all humans.
[1] I kind of want to do this from an innocent / artistic perspective myself. Perhaps a bot that responds with a bunch of rhetorical questions or onomatopoeia. Then I'd scale it to the point people start noticing and feeling weirded out by it. "Is this the new Gen Alpha lingo?" Alas, I have too many other AI projects.
I first tried Google; the results are dominating by commercial crap.
Then I tried the "google reddit" trick to try and find some real people's opinions... but look at all the blatantly bullshit comments on this Reddit thread; https://www.reddit.com/r/Thunderbird/comments/ae4cdg/good_ps...
---
(if anyone is wondering, the best option for Windows is to use 'readpst' command via WSL. Comes in the 'pst-utils' package).
"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."
I'm hoping to put an AI between me and my e-mail inbox this weekend (I had ChatGPT write most of the code; it's not much); not fully automated, but evaluating and summarising and categorising. I might extend that to e.g. give me an "algorithm" for my Mastodon timeline (despite all of the people insisting on reverse chronological, I'm at a few hundred people I follow and already can't keep up), and a number of other sites I visit. For most of these things latency does not matter, so e.g. putting them through llama.cpp rather than something faster is fine, and precision isn't critical (I won't trust it to automatically reply or automatically reject anything, but prioritisation an categorisation where missteps won't have any critical impact.
Sometimes signals are noise we just need to calibrate.
The first time you don’t get a job because of a reference you gave you learn a lesson. If it ever happens again, it’s on you.
Those of us who are careful internet readers have spent years developing good heuristics to use textual clues to tell us about the person behind the text. Are they smart? Are they sincere? Are they honest? Are they commenting in good faith? Those skills will soon be obsolete.
The folks at OpenAI, who are nominally on a mission to make sure AI "benefits all of humanity", have condemned us to a life sentence of fending off high-volume, high-quality bullshit. Bullshit that they are actively working to make harder to detect. And I think the first victims of that will be internet forums where text is the main signal, places like this and Reddit.
Definitely already the case, you really think Rust and SQLite would get more than a couple of upvotes otherwise? :D
Then again, maybe Google had some mandatory HN time for their employees, that would be enough to explain that :D
Self proclaimed GitHub star. But still only 5000 followers and projects max out at 8000 stars.
I don’t know what I had expected but I think it was bigger numbers than that.
I follow him on GitHub, and pay for some of his products. I have been heavily influenced by his coding styles, and the tools he uses. His code just looks so tight and perfect. He writes his stuff so open ended and reusable that he basically writes a method once, and then reuses it across numerous projects.
Look at this tight code: https://github.com/laravel/framework/blob/10.x/src/Illuminat...
I’d say that Adam Wathan is rapidly growing his influence as well, and is probably doing alright too.
It seems odd to title them influencers based on that.
At least that's how the 3rd party recruiter told me he found me. It's possible he was lying and thought it would impress me (it did).
My profile is more active than most, but very far from rockstar.
Had to change my Location (or some similar obvious field) in my GitHub profile to "Recruiters FUCK OFF" before they took the hint. ;)
Thankfully, GitHub introduced some other way to signal if you are/aren't interested in getting a job (toggle switch?) not long after, which seemed to work.
The book presents similar stories.
I tend to check the age difference between the earliest and latest commits because that lets me be sure it's not a project that someone spent a couple weeks coding up, dropped on github, and then forgot about. I'll also check the issues on there. I'm looking for more closed issues than open ones, but I'll also quickly scan over them to get a rough idea of how many are truly meaningful issues. I also get signals from the readme and docs. It's not a hard pass if there's issues with those, but it's certainly helpful to my opinion if they exist and are both clear and detailed.
So unless you are really well versed in the project and spent some time following it, stars actually might be a better indicator of the project quality and reputation.
God, I hate this. Every time I have an issue with something, look it up on the issue tracker and find the exact issue I'm having autoclosed as "stale" by a fucking bot because the author didn't reply "this is still an issue" once every 24 hours, it instantly makes my blood boil and I avoid using the software in question as much as possible in the future. Nothing screams "I care more about github numbers than my users or the quality of my software" more than this.
If one of the repos has many more stars, I weigh that strongly when choosing. Freshness of commits is definitely important, but for me the fact that many other people starred the repo shows that there are eyeballs and activity.
You're the kind that checks everything. Even if you had something valuable, a scammer wouldn't waste their time with you then there are easier fish to bait.
I really wish GitHub would have some sort of flag for "stale" projects. I use your methods too (issues, dates, etc.), and I'm usually disappointed when search results bring up ghost projects. However, in a few instances, I found a project that was similar to an issue I was working on that went one step beyond where I was, and even though it was a ghost project, it helped. But in general, these projects don't help. I'm also disappointed that I'm thinking, "Hmmm, maybe LLMs can help..."
It's almost like you are thinking of it as an expiration date and the software has spoiled.
Now that I think about it — it is a python wrapper around a boost library and neither of those have made backwards incompatible changes in a long time which is quite suspicious.
I doubt anyone would do this, but commit date can be arbitrarily changed.
As you indicate though, they require more effort to adjudicate. Are issues from core team members? Are commits meaningful? Is community activity meaningful? I wish GitHub would give allow us to parse things like this more easily.
My use of star count is generally a binary indicator. 1k+ is probably a legit project and below is probably still early. Beyond that, it's probably too noisy.
1. Indicates the astroturfing without actually specifically calling them out 2. Does so in a way where others can verify their work and use it on other repos 3. Uses their product to do so
Seems pretty brilliant to me.
> we track our own GitHub star count along with that of other projects. So when we spotted some new open-source projects suddenly racking up hundreds of stars a week, we were impressed. In some cases, it looked a bit too good to be true, and the patterns seemed off
If their competitor has fake-looking star counts, I'd expect them to be the ones best equipped and most likely to suspect it.
Really? I honestly just don't believe this... if I were to believe this, I think I'd have to conclude the world is just too broken to bother rescuing.
What did I miss? What's the best answer you've ever heard? How do you evaluate 3rd party dependencies?
I prefer to look at the recent commits, or any recent activity on the repo's issues, but I would like to know what else can be used as an indicator.
It takes a lot more effort to collect multiple metrics along different axes, understand the skew/bias of them and make an informed decision.
Visibility and ease of consumption are the most important aspects of a metric if you want people to use it.
The enterprises I deal with cared almost exclusively about stuff like license choices, support contract options, and "invoice billing" ;P. The vetting process I've dealt with at VCs was intense, having worked both sides of that situation; and I know multiple people who have worked data science jobs at such firms to try to better select investments. As for a "talented professional", I can pretty much guarantee they are going to look at your codebase, not the number of stars it has, while they evaluate any number of more reasonable things to judge an opportunity on (commute, pay, management style, etc.). A key property of competent deciders is that they aren't using trivial metrics.
(The firm X, however, is a more well-known name than my ex-employer was).
A while ago, I listened to a Freakonomics episode where it was discussed that businesses use proxies to both boost their image and to cover up their incompetency. The example was that a lot of businesses chose fancy names starting with A (like, AAA plumbers), so that they get listed first in business directories. These firms were later proven to be very incompetent and/or even fraudulent.
The relevant paper, also cited in the episode, was "A Business by Any Other Name": https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1667550.
I think 2 great community activity indicators are - Github issues and of slack/discord/discourse comments. One key thing with github issues in my opinions is that, If the github issues are mostly by the core team, it is not a great sign. You want a large mix of issues from customers or users and not from the team. This is a good indicator if the project is solving real problem or not. Stars is very low threshold action. Same goes with the slack comments, it should have both volume and freshness.
In a very strange way (but reflective of the economic regime) a startup that fakes stars vs a straight-arrow startup that doesn't is demonstrating a key element for success in business, which seems to require a significant element of bullshiting, and outright deceiving. The mantra has been that "grow grow grow" is the only guideline for success. Inflating your stars is just rookie hour practice for bigger better growth b.s. down the line.
That makes no sense to me. Speaking as someone who has been using YouTube Data API v3 and YouTube Analytics API v2 for many years, estimated minutes watched of a video shouldn’t be public info. So how can you “look at the total time played” on a competitor’s video?
I think you have it backwards, the other video was using fake likes to avoid having to improve their quality to get an equal number of eyeballs.
Overall, it's bad for everyone if someone can create fraudulent views: us, other companies, and most importantly, consumers.
> taking that time to improve your competing product to make it better.
Took less than 3 minutes to do the math and send the report. I'm a fast developer, but I can't improve our product that fast :-)
However, given sourcehut eschews the use such "social metrics" (which at some level I agree with the principle behind it, on the other hand I do appreciate the value of being able to give visibility to good projects) I usually mention in my README that "If you like the project and wish to promote it, feel free to star this github page".
I'm sure github probably wouldn't like this use-case, but the stars would certainly be genuine, even if possibly quite dodgy-looking.
It's still getting starred...
clearly you did too good a job on the README
Having said that, it may be worth thinking what is the price we may be paying as a community for this convenience, btw. MS Github is clearly already past the "embrace" phase, and well into the "extend" phase.
That was a bit shocking to me to learn.
It's unfortunate as I've seen stars used as a metric of trustworthiness in general user discussions.
I always expected there was a market for fake stars. I am trying to get a repo naturally to 1000 stars, but I would never buy them.
Heuristics can only be used to identify suspected spammers. Not everyone who behaves like a spammer is a spammer, it could be e.g. a random user with privacy settings on, or someone who didn’t update their bio in a while and it got affected by link rot, etc.
Even if a group of low activity accounts stars the same projects, it could be that the account owners just discuss these projects elsewhere.
[1]: https://github.com/andrewmcwattersandco/github-statistics
[2]: https://github.com/Homebrew/brew/blob/master/docs/Acceptable...
That first use is not unreasonable, in my opinion. The second one is questionable, at best.
There are obvious numeric usernames, but also fake orgs with repos for the users to fork and interact with, and a few account takeovers (i.e. someone had signed up for GitHub in 2015 to make a free wedding website, abandoned it, and the account fell into spammer hands). These used to be easier to report.
With collaterals too I presume [1]. I guess I've been the victim of some automated system. They have banned my account without warning or explanation and they've been ignoring my support tickets for about 2 months!
My own angle is that copilot has shifted the incentives around this practice, maybe substantially. Businesses want to get (free tiers of) their paid SaaS endpoints into copilot suggestions - it's a great funnel!
I'd guess that github is as likely as not to become an SEO spam battlefield (like the rest of the web).
That’s so brilliantly evil…
I can see the next generation of “how I got to $3m in passive income” articles being written (by ChatGPT) right now.
Show HN: there are maybe dozens of those posted everyday but they rarely hit main page.
Reddit ad is great to kick off the star growth, but unless you have something interesting to many people, don’t expect more than 50 stars on first day and plateau to a star every few days.
Most GH stars I’ve got was from somebody mentioning my project in comment in some heated discussion on HN. So I guess drama sells?
Kind of ironic that they’re using blog articles and social media to pander for more stars on their GitHub project.
https://twitter.com/Alexey__Kovalev/status/87184200877156761...
Leaving the arena is the only viable option. Software projects that aren't dependent on github drive their own vehicle, everyone else is on a crowded bus.
https://github.com/Hellisotherpeople/Bright
Edit: I love clustering, I really do, but I think that techniques like the one I am using are far superior to unsupervised learning for trying to detect fake accounts in this context.
I wonder if this is also in general OSINT or ISC^2 training - everything this article showed for breadtrails and reverse operation (e.g. pay a company to do the work, see how it is, evaluate the results, see if you can find other work similar/akin to it.)
Instead I found a system that seems to be thoughtfully designed and, crucially, easy to debug.