It’s not like we can throw away all the inductive biases and MSA machinery, someone upstream still had to build and run those models to create the training corpus.
My rough understanding of field is that a "rough" generative model makes a bunch of decent guesses, and more formal "verifiers" ensure they abide by the laws of physics and geometry. The AI reduce the unfathomably large search-space so the expensive simulation doesn't need to do so much wasted work on dead-ends. If the guessing network improves, then the whole process speeds up.
- I'm recalling the increasingly complex transfer functions in redcurrant networks,
- The deep pre-processing chains before skip forward layers.
- The complex normalization objectives before Relu.
- The convoluted multi-objective GAN networks before diffusion.
- The complex multi-pass models before full-convolution networks.
So basically, i'm very excited by this. Not because this itself is an optimal architecture, but precisely because it isn't!
Using MSAs might be a local optimum. ESM showed good performance on some protein problems without MSAs. MSAs offer a nice inductive bias and better average performance. However, the cost is doing poorly on proteins where MSAs are not accurate. These include B and T cell receptors, which are clinically very relevant.
Isomorphic Labs, Oxford, MRC, and others have started the OpenBind Consortium (https://openbind.uk) to generate large-scale structure and affinity data. I believe that once more data is available, MSAs will be less relevant as model inputs. They are "too linear".
Only if you are willing to call a billion years of evolutionary selection a "simple ruleset"
It seems like the Folding @Home project is still around!
In other words, it’s a different approach that trades off versatility for speed, but that trade off is significant enough to make it viable to generate protein folds for really any protein you’re interested in - it moves folding from something that’s almost computationally infeasible for most projects to something that you can just do for any protein as part of a normal workflow.
2. The biggest difference between folding@home and alphafold is that folding@home tries to generate the full folding trajectory while alphafold is just protein structure prediction; only looking to match the folded crystal structure. Folding@home can do things like look into how a mutation may make a protein take longer to fold or be more or less stable in its folded state. Alphafold doesn’t try to do that.
https://www.distributed.net/RC5
https://en.wikipedia.org/wiki/RSA_Secret-Key_Challenge
I wonder what kind of performance would I get on a M1 computer today... haha
EDIT: people are still participating in rc5-72...?? https://stats.distributed.net/projects.php?project_id=8
[1] https://foldingathome.org/2024/05/02/alphafold-opens-new-opp...
Maybe these are just projects they use to test and polish their AI chips? Not sure.
Frankly, it's a great idea. If you are a small pharma company, being able to do quick local inference removes lots of barriers and gatekeeping. You can even afford to do some Bayesian optimization or RL with lab feedback on some generated sequences.
In comparison, running AlphaFold requires significant resources. And IMHO, their usage of multiple alignments is a bit hacky, makes performance worse on proteins without close homologs, and requires tons of preprocessing.
A few years back, ESM from Meta already demonstrated that alignment-free approaches are possible and perform well. AlphaFold has no secret sauce, it's just a seq2seq problem, and many different approaches work well, including attention-free SSMs.
and now I'm even more curious why they thought "light aqua" vs "deep teal" would be a good choice
The different colours are for the predicted and 'real' (ground truth) models. The fact that it is hard to distinguish is partly the - as you point out - weird colour choice, but also because they are so close together. An inaccurate prediction would have parts that stand out more as they would not align well in 3D space.
https://genomely.substack.com/p/simplefold-and-the-future-of...
But as with anything in research, it will take months and years to see what the actual implications are. Predictions of future directions can only go so far!
People often like to say that we just need one more algorithmic breakthrough or two for AGI. But in reality it's the dataset and the environment based learning. Almost any model would do if you collected the data. It's not in the model, it's outside where we need to work on.
Doing too many things at once makes methods hard to adopt and makes conclusions harder to draw. So we try to find simple methods that show measurable gain, so we can adapt it to future approaches.
Its a cycle between complexity and simplicity. When a new simple and scalable approach beats the previous state of art, that just means we discovered a new local maxima hill to climp up.
However, it seems like anyone can download the parameters for AlphaFold V2: https://github.com/google-deepmind/alphafold?tab=readme-ov-f...
Then why do we need customized LLM models, two of which seemed to require the resources of 2 of the wealthiest companies on earth (this and google's alphafold) to do it?
It's indeed a large model. But if you knew the history of the field, it's a massive improvement. It has progressed from a almost "NP" problem only barely approachable with distributed cluster compute, to something that can run on a single server with some pricey hardware. The smallest model is only here is only 100M parameters and the largest is 3B parameters, that's very approachable to run locally with the right hardware, and easily within the range for a small biotech lab (compared to the cost of other biotech equipment)
It's also (i'd argue) one of the only truly economically and sociably valuable AI technologies we've found over the past few years. Every simulated protein fold is saving a biotech company weeks of work for highly skilled biotech engineers and very expensive chemicals (In a way that that truly only supplement rather than replace the work). Any progress in the field is a huge win for society.
This doesn't seem like particularly wasteful overinvestment.
Granted, I'm more excited about the research coming out of arc
Predicting the end-result from the sequence of protein directly is prone to miss any new phenomenon and would just regurgitate/interpolate the training datasets.
I would much prefer an approach based on first principles.
In theory folding is easy, it's just running a simulation of your protein surrounded by some water molecules for the same number of nano-seconds nature do.
The problem is that usually this take a long time because evolving a system needs to compute the energy of the system as a position of the atoms which is a complex problem involving Quantum Mechanics. It's mostly due to the behavior of the electrons, but because they are much lighter they operate on a faster timescale. You typically don't care about them, only the effect they have on your atoms.
In the past, you would use various Lennard-Jones potentials for pairs of atoms when the pair of atoms are unbounded, and other potentials when they are bonded and it would get very complex very quickly. But now there are deep-learning based approach to compute the energy of the system by using a neural network. (See (Gromacs) Neural Network Potentials https://rowansci.com/publications/introduction-to-nnps ). So you train these networks so that they learn the local interactions between atoms based on trajectories generated from ab-initio theories. This allows you to have a faster simulator which approximate the more complex physics. It's in a sort just tabulating using a neural network the effect of the electrons would have in a specific atom arrangements according to the theory you have chosen.
At any time if you have some doubt, you can always run the slower simulator in the small local neighborhood to check that the effective field neural network approximation holds.
Only then once you have your simulator which is able to fold, you can generate some dataset of pairs "sequence of protein" to "end of trajectory", to learn the shortcut like Alpha/Simple/Fold do. And when in doubt you can go back to the slower more precise method.
If you had enough data and can train perfectly a model with sufficient representation power, you could theoretically infer the correct physics just from the correspondence initial to final arrangements. But if you don't have enough data it will just learn some shortcut and accept that it will be wrong some times.
No, the environment is important. Also, some proteins fold while being sequenced.
Folding can also take minutes in some cases, which is the real problem.
> which is a complex problem involving Quantum Mechanics
Most MD simulations use classical approximations, and I don't see why folding is any different.
Speeding-up the folding is not the real problem, knowing what happen is. One way to speed-up the process is just to minimize the free-energy of the configuration (or some other quantity you derive from the neural network vector potential). (That's what the game fold-it was about : minimizing the Rosetta energy function). An other way would be to just use generative method like diffusion model to generate a plausible full trajectory (but you need some training dataset to bootstrap the process). Or work with key-configuration frames. The simulation can take a long time but it goes through specific arrangements (the transitions between energy plateau), and you learn these key points.
The simulator can also be much faster because it doesn't have to consider all the pair of atom arrangements (n^2 behavior if you are naive) into O(n) with n the number of atoms (with the bigger constant which is running the neural network hidden inside the O notation).
The simulations are classical but fundamentally they rely on the shape of the electron clouds. The electron density can deform (that's what bonding is), providing additional degrees of liberty, allowing the atom configuration to slide more easily against itself and avoid getting stuck in local optimum. Fortunately all this mess is nicely encapsulated inside the neural network potential and we can work without worrying about the electrons, their shape being implicitly defined by the current position of the atoms (using the implicit function theorem make abstracting their behaviour sound because of the faster timescales).
I am not trying to defend Apple or Siri by any means. I think the product absolutely should (and will) improve. I am just curious to explore why there is such negativity being directed specifically at Apple's AI assistant.
1. It seems to be actively getting worse. On a daily basis, I see it responding to queries nonsensically, like when i say “play (song) by (artist)” (I have Apple Music) by opening my Sirius app and putting on a random thing that isn’t even that artist. Other trivial commands are frequently just met with apologies or searching the web.
2. Over a year ago Apple conducted a flashy announcement full of promises about how Siri would not only do the things that it’s been marketed as being able to do for the last decade, but also things that no one has seen an assistant do. Many people believe that announcement was based on fantasy thinking and those people are looking more and more correct every day that Apple ships no actual improvements to Siri.
3. Apple also shipped a visual overhaul of how Siri looks, which gives the impression that work has been done, leading people to be even more disappointed when Siri continues to be a pile of trash.
4. The only competitor that makes sense to compare is Google, since no one else has access to do useful things on your device with your data. At least Google has a clear path to an LLM-based assistant, since they’ve built an LLM. It seems believable that android users will have access to a Gemini-based assistant, whereas it appears to most of us that Apple‘s internal dysfunction has rendered them unable to ship something of that caliber.
And now that we have ChatGPT with voice mode, Gemini Live, etc which have incredible speech recognition and reasoning comparatively, it's harder to argue that "every voice assistant is bad" still.
If I could buy a phone without an assistant I would see that as a desirable feature.
Meanwhile, people expect perfection from Siri. At this point a new version of Siri will never live up to people’s expectations. Had they released something on-par with ChatGPT, people would hate it and probably file a class action lawsuit against Apple over it.
The entire company isn’t going to work on Siri. In a large company there are a lot of priorities, and some things that happen on the side as well. For all we know this was one person’s weekend project to help learn something new that will later be applied to the priorities.
I’ve made plenty of hobby projects related to work that weren’t important or priorities, but what I learned along the want proved extremely valuable to key deliverables down the road.
> We largely adopt the data pipeline implemented in Boltz-11 1https://github.com/jwohlwend/boltz (Wohlwend et al., 2024), which is an open-source replication of AlphaFold3
I believe the story here is largely that they simplified the architecture and scaled it to 3B parameters while maintaining leading results.