[0]: https://twitter.com/willnorris/status/1518694675909013504
First of all, "the algorithm" is probably hundreds of thousands of lines of code, including all the tedious boilerplate like cache policies and multi-AZ logic.
And second of all, doesn't the algorithm include machine learning components, which are trained on terabytes of data? That data will likely be impossible to open source. And open sourcing the neural nets without the training data is mostly meaningless from a transparency perspective?
That's an interesting point. A practical description of the algorithm from the perspective of someone trying to game it may be more useful than anything Twitter or Google would release.
Gesture carries weight to the users too.
Not sure any big company has tried this before. I could be wrong, but either way looking forward to it / FWIW hope it catches on.
The point of releasing it is to let people know exactly why they see the tweets they do in the order they do. I hope Elon just goes back to time base ordering of tweets.
Even the people who build these systems barely know what the algorithm is going to do, much less why. It will be a herculean task to try and convey that to an average user.
And developers will be able to train a model using it on a subset of Twitter data. Just that the quality of the outcome won't be the same as having the full set of Twitter data.
At the minimum, I would make a private Github repo first, add all relevant commits, and then make it public once there's actually content.
Either this is a mistake, or this is a really, really misguided attempt at a joke from Twitter.
https://twitter.com/willnorris/status/1518694675909013504
Which seems like a promise they intend to actually open source something there.
So if it was a WIP, it'd be a private repo until it's ready to release publicly.
That's an algorithm.
I mean even if timelines were totally random, or based on some external facts, there is an algorithm that is being used to order them.
This isn't just an academic distinction. Claiming 'there is no algorithm' because the algorithm is intentionally or unintentionally obfuscated or complicated has implications if that claim of 'no algorithm' is accepted. If my algorithm for approving mortgage applications is explicitly racist, I can just spread it's functionality across myriad services owned by lots of teams, make it almost impossible to figure out how it works, and then avoid any responsibility by saying 'there is no algorithm to decide loan approvals'? That would be bullshit!
And does every user have their own algorithm?
And could it be made readable to a human?
- Search results
- Comment Order
- Timeline Order
- Trends
- Human vs code
Personalization in general. Big gigantic “why” when it happens to you
You know what? This does demonstrate the internal problems inside Twitter and shows the need for shakeup.
There is no "no algorithm."
It's an interchangeable function, it would only be publicized if it's clear to leadership that it wouldn't affect their revenue if people started trying to game towards the published algorithm.
But there definitely is a relationship algo that could be considered theirs, like all social medias inflating the bubbles users all feel.
Even if we were to open source all associated code and publish all related documents it would be very difficult to make sense of the entire system. That is precisely why companies such as Twitter A/B test the hell out of everything. What most people think of as "the algorithm" is a complex system that receives many inputs (maybe hundreds) and has dependencies on many other internal Twitter services. Tweets likely pass through multiple filtering steps as well as scoring before you ever see them. Each of these steps is highly contextual, depending on: location, past tweets, verification status, etc. You can attempt to predict the effect of a certain change, but you never know the actual outcome until you test it.
I think what will ultimately happen is that _some_ details will be published. Elon will parade that around as a victory for free speech as Twitter is now more "open". In reality, nothing of value will be gained as "the algorithm" isn't a simple function.
There is typically clear objective function of a recommendation system.
What Twitter is optimizing for is what’s of interest here. And some of the hidden business rules. It’s likely these are specified in the code in an obvious way.
How exactly they achieve that is the part that is complex and relatively indecipherable.
It’s possible that it’s designed in such a way the optimization objectives are also unclear, but that would indicate a bad design and be to the detriment of the company and users.
Many complicated research papers have had no issues describing their models at a high level. This should be no different.
My point is that the devil is in the details and implementation. These details are likely something that no one person understands and no one person is able to fix. The concept of being able to extract "the algorithm", factor it out from the codebase and share it with the public doesn't make sense to me. It won't be possible to fully understand how Twitter serves recommendations and ranks posts without understanding how all the different services at Twitter interact. Are they planning on open sourcing all of Twitter? Highly doubtful.
All these feed rankings are complex combinations of features, models coupled with weights and filters. On top of this abuse detection layers are added.
Unless Musk is planning to open source user data to show what all the "scores" and "features" for all the entities are and how they were reached to, this will make no sense. The whole argument against some people being downranked has been, why me? Just writing a whitepaper to tell the general methodology, is not going to make that go away.
On top of that, exposing every vector through which you measure and stop abuse, will just allow for more sophisticated abuse.
Twitter's been pretty transparent in how it "deranks" certain accounts [1]. What more would come from opening the code that certainly not include the actual database of "no no terms" (if you were to believe that exists)?
[1] https://blog.twitter.com/en_us/topics/company/2022/our-ongoi...
[0]: https://mashable.com/article/eu-digital-services-act-big-tec...
No one said that. You created a straw man and are arguing with it.
This comment says more about you than you think.
Open sourcing algorithm or code is not about everyone go and analyze the same, instead when controversy or issues arise it'll be readily available for independent experts to review it.
We can think of the main interaction as being a query which is an RPC payload. The contents contain the user request and a wide amount of other context (either referenced by a collection of keys like cookies, or materialized like fields that specify the user's age) and the response is a web page which contains sections (the web search response to the query, as well as the ads; either these could be rendered to two different frames, or interspersed, by the result presentation engine).
That query -> frontend translates into a tree or a graph of requests which collect up various bits of contextual data required to satisfy the query. For example, the query terms might be rewritten slightly and then sent to a web search backend which searches/ranks documents and returns the top matching documents on the organic web, or sent to an ads backend that returns the top matching bidders for those query terms. Again, just RPC/responce, although the actual context that the frontend and backend systems are dealing with, and use to modify the result, are truly enormous.
Each of those backend systems itself was produced with an enormous amount of data processing and contextual data that is available at serving time. All of this is implemented using various algorithms; everything from the TCP algorithms that manage bandwidth to the neural networks doing inference on the joint product of the user context and the query context and the ad context, and the logging system that writes the queries and their clicks to centralized storage for more ML training.
In theory though you could set up a system that compiled the full web stack, and ran the end to end of a user query, dumping all the intermediate RPCs, etc, from a modestly sized instantiation of the production system. and people could sit down and inspection what terms affected query result order, or which pages were omitted at which part of the filtering, or what data was logged.
It would be hell for a team to maintain and keep up to date wrt the production system, but many folks do this any way to have a simple version of the system around so they can make quick changes and see if it breaks part of the complex system without doing a full deployment.
It can be pseudo-code or diagram or whatever that can be used to understand what logic lies behind decision making.
There are ways to translating trained ML models and associated systems into understandable hierarchical rules.
Twitter's timeline is NOT AGI.
In fact, a lot of people here really think what people are talking about is the equivalent of what is handled in a subroutine.
No, what people are talking about when they talk about "the algorithm" is anything affecting the result set they're reading. Concepts like eventual consistency and edge computing are... well... a part of a model which laypeople, and even reasonably technical people call an "algorithm."
Being pedantic about whether or not this happens in an SQL query, or across multiple codebases, or by region, doesn't escape the question.
> Being pedantic about whether or not this happens in an SQL query, or across multiple codebases, or by region, doesn't escape the question.
Actually, epistemic ~"muddying of the waters" is a well proven technique to control perceptions and public discourse. If it works on HN folks, I expect it would work much more easily on amateurs.
I say this as someone whose political views, if you force them onto the left-right spectrum, probably end up about 80% toward the left. E.g. I've spent millions over the past several elections supporting the Democrats.
It used to be that censorship was something the right did, and free speech was something the left were in favor of. But over the last few decades, banning "problematic" ideas has become a huge component of left culture (http://paulgraham.com/heresy.html).
Plus tech companies in general, and especially Twitter, lean to the left. Imagine walking around Twitter pre-Covid. You'd find plenty of openly far-left employees. How many openly far-right employees would you find? I don't think you'd find any.
The combination of (a) the left's recent focus on banning heretical ideas, (b) the leftward lean of tech companies generally, and (c) the leftward lean of Twitter even among tech companies, means that right-wing speech is much more likely to get banned on Twitter than left.
That's why people on the far right keep starting lame Twitter alternatives. You don't see people on the far left doing that. They don't need to. They have Twitter."
Even if that's precisely true, is it not good to be creating a more trusted space for everyone? The grievances, regardless of merit, are mostly coming from the right. If you want to create a service that caters to all you're going to have to address their concerns. If he can do that in a way that is fair to all, it sounds like a win to me.
https://amp.usatoday.com/amp/1248099002
https://www.vox.com/2017/6/27/15878980/europe-fine-google-an...
https://i.imgur.com/MVlshAT.png
You don't have to be conservative to see there's a pretty significant bias, just in the headlines. I'm a Pacific Green and I can still see it.
It may not have been algorithmic, but it definitely happened.
Whatever. He paid for it. Private company. Do what it wants.
> it would be very difficult to make sense of the entire system
No. Not buying that. Difficult isn't the same as impossible and, if only to game the system (harder,) people will figure it out. And even if it isn't 100% possible to reproduce the results based on what is released significant insights will still emerge.
Further, there is some ceiling on the complexity. Twitter operates at scale and that means they can't actually burn 52kWh of power for every tweet or store TiBs of metadata for every user to do the analysis or take 30 minutes to publish. Likely it's a pretty efficient system and, therefore, limited in complexity.
What could a rogue employee do?
I was actually wondering some people may want to remove traces of what they have been doing.
I wish someday we can see the internal communications lead to the Hunter Biden laptop story ban.
And having been in a company that was taken over, it's a mixture of emotions - is my job safe, will this be the same culture I joined for etc. etc.
This is interesting question since RSUs are a big part of total comp but how are unvested RSUs dealt with when the stock is retired? Are those put on a future cash comp schedule? And if so at what conversion rate?
error forking repo: HTTP 403: The repository exists, but it contains no Git content. Empty repositories cannot be forked. (https://api.github.com/repos/twitter/the-algorithm/forks)
My thoughts:
- Explicit rules for temporary and permanent bans
- Edit button
- More fun and thoughtful conversations like HN
- Less thought bubble Brooklyn based reporters, less VC and side grind hustle snake oil, maybe more comedians and memes?
I think there is a place for a smarter algorithm than "ORDER BY date DESC", but one that is not designed to manipulate users into addiction.
even when following too many to read everything, i preferred chrono because it would yield a coherent slice of what was happening. an unbiased sample.
twitter is basically a medium for conversation.
imagine there's a large party. would you rather listen to an out-of-order "most important" set parts of the conversation, or just a slice of conversation from a particular time?
well, actually, both can be interesting, but generally the slice is more coherent. :-)
1. Insertion of tweet to tweets table.
2. Insertion of that tweet-id to the home timelines of all that user's followers.
3. Insertion of that tweet-id to the user-timeline of that user.
On the read-path, if I'm not mistaken, the only join that happen is between the requested timeline and the tweets table (which is replicated across cluster of machines but not partitioned, or at least I remember reading that was the case not many years ago)
For about a week they made a change that prevent that chronological timeline from being the default, but they reasonably quickly rolled that back. https://www.theverge.com/2022/3/14/22977782/twitter-default-...
I then use tweet deck which shows a column of tweets per list.
As these are separated by subject and are chronological, it makes it far easier to follow.
Twitter's EU user base is probably [3] above the 45 million threshold that triggers the strictest transparency requirements under the Act. So perhaps they figure if they're going to be forced to disclose anyway, they might as well do it proactively.
[1] If it's even coherent to talk about their feed ranking system as a single algorithm — see the other comments in this thread.
[2] https://www.theverge.com/2022/4/23/23036976/eu-digital-servi...
[3] https://www.statista.com/statistics/242606/number-of-active-...
So not a troll, but yes it is odd to put up an empty repo, and announce the repo before there is anything in it.
That doesn’t mean it’s a joke, I see it as a show of goodwill — that there are a handful of people inside Twitter that are excited for transparency and for a revenue model that isn’t entirely based on ads, that are excited to get to work on this right away.
So I'm not sure what the ultimate point of this exercise is other than producing faux-transparency.
I think only if you offer twitter users the level of first amendment protection they'd expect with a government body. Otherwise reporting to congress would be an a bold faced circumvention the first amendment. Twitter is a privately held company with no need to report to congress.
There is great opportunity to abuse this by Twitter, yes. There is also a lot of money to be made. But in defense of some of that being secret, is the fact that any publicly known ruleset (with no hidden exceptions) _will_ be exploited by bad actors. Imagine if search engines told spam sites exactly why their site dropped in page rankings.
Elon polled Twitter users about this and the response was overwhelmingly in favor of open source and transparency. Everyone on Twitter got a vote.
If you oppose transparency, as many now are, you lose your credibility. So it’s another one of Elon’s people hacks, and look at all the morons falling for it.
Like, there's no public admission right now of whether "shadow banning" or "ghost banning" is even officially a thing!
Some transparency seems unquestionably more powerful than none, and we can work from there.
Maybe that is where it is going.
That seems... bizarre to me?
* Chronological - reverse sort by date
* Home - for all of the followed topics, recommended topics, retweets and tweets in the past day determine the estimated level of engagement, include the highest and reverse sort by date. This is likely to be a fairly basic ML model.
It will be uncontroversial, technically unsophisticated and of no practical use to anyone - users, developers or researchers.
This is not going to be PageRank where some genuine new insight was discovered.
I've built hundreds of models and run a ML company and I don't believe it's technically possible for this rule not to be the case.
I imagine they'd probably start with documentation and white-papers that communicate "here's how we intend for it to work".
It's seriously unlikely anyone in Twitter knows actually works how any non-trivial algorithm in the company works. To figure THAT out, they could decide to do a company-wide documentation and instrumentation push like they probably would've had to do for GDPR anyway, which is painful and boring and going to take a very long time.
Failing that, they could just say 'the algorithm as it stands is no longer fit for purpose, given part of its core requirement has become that it needs to be transparent and publishable, and presumably legible. We need to make a new one. Publish the core algorithm. We probably won't deploy it in that exact state, it's going to span multi-services and so on, you obviously don't get the data we used to train the models, but we will work backwards from it and here's an open mechanism to measure how true-to-form it actually is'
if twitter is a game, sinking $43bn into it is kinda like winning or losing the grand final boss level. (unclear which)
wish elon would get back to facilitating the building of useful things. we still don't have a great clean energy generation story.