AI Revolutionized Protein Science, but Didn't End It

106 points by sblank 3 days ago

trivexwe 3 days ago

Weird article.

It mentions multiple times that ~”the protein folding problem is solved” as well as multiple instances of ~”but there are limitations to this technique and it is often missing crucial details”.

It really is difficult to conceptualize these highly nonlinear problem spaces, like protein folding, until you attempt to work with them.

Many in software development have an intuitive understanding of the difficulty evidenced in the community’s ~“the last 10% took 100% of the time” meme.

Even in a nonlinear problem spaces you have “trivial” solutions.

Terry Tao famously coauthored a paper finding arithmetic progressions for generating sequences of primes.[1] The sequences found are “trivial” in terms of “solving the prime sequence problem” in that they are sparse, the sequences are finite, and progressions lack a method of find more.

These machine learning tools are by design approximation engines. I’m unsure of any results that prove one way or the other that it is possible to pass a bound of approximation that provides exact solutions. (think, an approximate solution that only fails to provide exact solutions for solutions that are trivial using a different method, I think a lot of work I p-adics is motivated similarly)

I feel these machine learning techniques are expanding the definition of “trivial solutions” to include those capable of being solved by their convoluted methods (back prop, etc). Since this new subset of the space that can be labeled “solved” appear more complex than known trivial solutions people assume the whole space must be known, and this is where the difficult conceptualization rears its influence.

Protein folding is still an unsolved problem, and I’m dubious of the notion machine learning will ever solve it, but hopefully we get some helpful science out of it.

[1] https://en.m.wikipedia.org/w/index.php?title=Green%E2%80%93T...

eru 3 days ago

> Protein folding is still an unsolved problem, and I’m dubious of the notion machine learning will ever solve it, but hopefully we get some helpful science out of it.
As a working hypothesis, protein folding assumes that a protein folds into the globally lowest energy configuration. And that's a good assumption for a start.
However, nature isn't magic and can't magically solve global optimisation problems. If there's a region in configuration space with a local minimum and high enough energy 'walls', this might be stable enough for the protein to be stable.
For reasons of computational complexity, I agree that machine learning will probably never solve the global minimisation problem. But the complicated and messy local optimisation problem that we see in reality might very well be solvable eventually by something like machine learning.
Why are you dubious? Where do your objections come from?
- gabia 2 days ago
  
  Great points about the energy minimisation issue. Funnily enough, this is actually a problem with de-novo protein design at the moment: the designed proteins are _too_ stable! Compared to natural proteins. Protein are often not static shapes, they are machines that need to be dynamic - in other words what you said, they do not live at some deep global optimum.
  - eru 2 days ago
    
    Interesting point!
    > [...] in other words what you said, they do not live at some deep global optimum.
    I think what you said only depends on the minimum being relatively flat (instead of deep); but it doesn't matter whether it's global or local.
    
    exmadscientist 2 days ago
    
    > I think what you said only depends on the minimum being relatively flat (instead of deep); but it doesn't matter whether it's global or local
    No. There is no such thing as a "global minimum" energy conformation, because the conditions vary wildly. Many protein structure changes are brought about by changes in the local chemical potentials and even electric fields. This is not something you can get a good grip on by thinking in terms of "flat minima".
- trivexwe 3 days ago
  
  > Why are you dubious? Where do your objections come from?
  That the results the machine learning techniques provide are still nondeterministic.
  Meaning that they are, in terms of identifying other local minima that satisfy the constraints, as good as a guess.
  If the provided solution also came with a method of systemic modification to derive all other solutions that satisfy the constraints, then I would be satisfied.
  Without that you are unable to say with certainty that your local minima is correct even if nature fails to adhere to the lowest energy assumption.
  > However, nature isn't magic and can't magically solve global optimisation problems.
  I wonder sometimes. Let’s remember, this is an open question after all.
  I have a long standing hypothesis that an algorithmic solution to the global optimization problem is what lends action potentials the appearance, or essence?, of what we mean when we speak of “consciousness”.
  But I am a more inclined toward the abstract aspects of the mathematics behind the problem, and leave advocacy for the current techniques to researchers developing practical solutions with them.
  I applaud the people who toiled with X-ray crystallography to build the field to the point that a machine learning technique could be developed.
  - eru 2 days ago
    
    > That the results the machine learning techniques provide are still nondeterministic.
    I think I know what you are trying to say, but 'determinism' or not isn't the problem. You can run machine learning methods completely deterministically: just use a pseudo-random-number-generator (and be careful about how you seed it, and be wary of the problems with concurrency etc).
    > If the provided solution also came with a method of systemic modification to derive all other solutions that satisfy the constraints, then I would be satisfied.
    > Without that you are unable to say with certainty that your local minima is correct even if nature fails to adhere to the lowest energy assumption.
    Have a look at how integer linear programming solvers work. They use plenty of heuristics and non-determinism for finding the solution, but at the end they can give you a proof that what they found is optimal.
    You are right, that you don't get that kind of guarantee with current machine learning approaches. Though you could modify them in that direction. (Eg if you added machine learning to an integer linear programming solver, you would hook it in as a new heuristic, but you would still want the proof at the end.)
    > I have a long standing hypothesis that an algorithmic solution to the global optimization problem is what lends action potentials the appearance, or essence?, of what we mean when we speak of “consciousness”.
    Sounds like woo. Protein folding in bacteria and yeast work pretty similar to how it works in humans. In fact, we can transfer genes from us to yeasts to produce many of the same proteins human produce. But you'd be hard-pressed to argue that yeast are sentient.
    This reminds me of how some people claim that soap films are super special because those films can solve optimisation problems. See eg https://highscalability.com/why-my-soap-film-is-better-than-... If you put soap film between a bunch of supports, even if the supports have complicated shapes, the soap film will tend to minimise its overall surface area.
    Of course, if you look deeper into it, and do larger scale experiments, you figure out that the soap only assumes a local minimum.
    
    trivexwe 2 days ago
    
    > Sounds like woo.
    O, definitely woo. I tried to make that explicitly clear by using “hypothesis” and “appearance”.
    My hypothesis is less “optimization solutions == consciousness” and more positing that our brains, “action potentials” was meant as cheeky shorthand for the human brain, use an “optimization solution” that we identify as “consciousness”, or as you put it “sentience”.
    But to quote South Park, “and I base that on absolutely nothing”. ;P
    
    eru 2 days ago
    
    You might like https://scottaaronson.blog/?p=735 for some speculation on those topics with slightly more technical grounding. Direct link: https://www.scottaaronson.com/papers/philos.pdf
    Especial the chapter: 'Computational Complexity and the Turing Test'
- ngrilly 2 days ago
  
  > As a working hypothesis, protein folding assumes that a protein folds into the globally lowest energy configuration. [...] If there's a region in configuration space with a local minimum and high enough energy 'walls', this might be stable enough for the protein to be stable.
  Sounds like gradient descent :)
  - eru 2 days ago
    
    Well, or hill climbing in general.
- spywaregorilla 2 days ago
  
  I feel like there should be a much stronger effort to solve optimization problems with ML enabled guesses. It's arguably the most important problem to be solving to improve ML itself.
  Humans, for example, can provide extremely strong guesses by just eyeballing travelling salesmen problems without doing any calculations. If we could use ML to take a problem and guess how to reformulate it with 95% of the search space cut out, we would be in a much stronger place. My gut says this should be theoretically possible and is probably the mechanism that under the hood biological learning systems use to such a great effect that its ok to just use greedy and less efficient methods to do last mile of optimization without something like backprop.
- dekhn 2 days ago
  
  The working hypothesis you described was considered fairly obsolete some time ago. The current model is much more "most proteins fold to kinetically accessible states". The assumption of global lowest energy led to a lot of wasted effort and misled computer scientists. But along the way to this understanding we learned an awful lot about the forces that affect folding- see for example "hydrophobic collapse".
nybsjytm 3 days ago

I'm a mathematician but tbh I have no clue what you mean by saying that arithmetic progressions of primes are "trivial" or analogous to anything here or in machine learning.
- trivexwe 3 days ago
  
  Yeah, the messaging got a little muddled, but the relation was purely analogical.
  I was trying to point to a situation where you have a clear problem: a generating function for the prime number sequence; and a solution that identifies a small subset of the intended sequence without addressing, or even informing in any substantial way, the full breadth of the original problem.
  > At the time of writing the longest known arithmetic progression of primes is of length 23, and was found in 2004 by Markus Frind, Paul Underwood, and Paul Jobling: 56211383760397 + 44546738095860 · k; k = 0, 1, . . ., 22.'.
  The triviality was overloaded to both imply that calculating this subset is trivial, it is a simple arithmetic progression, and that subset of the full prime number sequence is now trivial to produce.
  In the same way that the Green-Tao theorem has yet to lead to a complete solution to the prime number sequence, I feel, the machine learning techniques will fail to lead to a complete solution to protein folding.
  - nybsjytm 2 days ago
    
    It would be very hard to make a good analogy with this since the problem of "finding" arithmetic progressions is, as far as I know, of negligible interest compared to the structural knowledge of their existence. The situation is perfectly reversed in both computational biology and machine learning. But maybe I misunderstand what you mean by "a complete solution to the prime number sequence."
dekhn 2 days ago

This is a bit pedantic, but: AF says little to nothing about protein "folding". It is focused on static structure prediction. The history of this is a bit muddy but if you follow the details carefully you'll see that "protein folding" is a term that references the physical process by which proteins adopt their "final" conformations (or more accurately, interconvert between a bunch of accessible conformational states), while static structure prediction only cares about the final conformational state (possibly states).
Although many people say "protein folding problem" that's really referring to a different and far more complex problem than static structure prediction. What is the exact trajectory that a protein follows when moving from the fully unfolded state to the final states? What forces dominate that process? How do proteins overcome large barriers so quickly? To what extent does the cost of interacting with water dominate? What are the rates at which fully folded proteins interconvert between substates? Which proteins will never fold on their own, why, and how do they get folded by other proteins?
aaron695 3 days ago

[dead]

ak_111 2 days ago

"...Some cell biologists and biochemists who used to work with structural biologists have replaced them with AlphaFold2 — and take its predictions as truth. Sometimes scientists publish papers featuring protein structures that, to any structural biologist, are obviously incorrect, Perrakis said. “And they say: ‘Well, that’s the AlphaFold structure.’”"

It is amazing that this happens. I am not naive about academic standards, but if something is clearly wrong and used in a paper (especially one with consequences on medical health) then it should be quite easy to name-and-shame until the editors of the journal force the authors to make a redaction or correction if the authors don't do it themselves. Otherwise people should start name-and-shame the journal and its reputation should sink.

Also I am curious if there are already lists of known incorrect predictions by Alphafold, shouldn't this be published and alphafold's database tag such predictions accordingly to notify users that these particular predictions are proven to be wrong.

Mikhail_K 2 days ago

The article states: "However, in about 10% of the instances in which AlphaFold2 was “very confident” about its prediction (a score of at least 90 out of 100 on the confidence scale), it shouldn’t have been, he reported: The predictions didn’t match what was seen experimentally."
This is the number one issue with using the so-called "deep learning": the results may be completely wrong and there is no known way to predict when they will be or detect when they are (relying on "deep learning" alone).
The worse issue is that by "deep learning" we learn only the coefficients that give accurate predictions on a training set. Extrapolating the results is the hopeful leap of faith that is known to break down catastrophically on some inputs. The "neural nets" do not give us the new knowledge, but rather, an attractive nuisance of a tool.
- bamboozled 2 days ago
  
  This seems like it should be bigger news than it is?
  - Mikhail_K a day ago
    
    Oh it will be big enough when the AI stocks bubble pops.

DrScientist 2 days ago

It is/was a brilliant piece of work ( Nobel prize level ) - however I think the impact is over-hyped - as somebody who has experimentally solved a protein structure - I can tell you knowing the structure doesn't necessarily help you understand the biology - not every structure is as functionally obvious as the structure of DNA for example.

In terms of drug discovery - even assuming the models are as good as experimental structures, you only get the same benefit as experimental structures - which have helped small molecule drug discovery - but I would argue not transformed it. All the existing challenges with structure based drug design remain.

BTW while Alphafold 2 was a big step forward from Alphafold(1) - it wasn't a complete shock as Alphafold 1 had already topped the charts in a previous competition a couple of years earlier.

mensetmanusman 2 days ago

“ knowing the structure doesn't necessarily help you understand the biology ”
I think most everyone recognizes this, but also believes it is probably necessary to know the structure to understand the biology in the future. Ie a necessary yet insufficient advancement that is Nobel worthy.

nybsjytm 3 days ago

Great article, covers well both the achievements and the shortcomings. It's crazy how many people write about these kinds of AI developments while completely skipping over anything like the following:

> The “good news is that when AlphaFold thinks that it’s right, it often is very right,” Adams said. “When it thinks it’s not right, it generally isn’t.” However, in about 10% of the instances in which AlphaFold2 was “very confident” about its prediction (a score of at least 90 out of 100 on the confidence scale), it shouldn’t have been, he reported: The predictions didn’t match what was seen experimentally.

> That the AI system seems to have some self-skepticism may inspire an overreliance on its conclusions. Most biologists see AlphaFold2 for what it is: a prediction tool. But others are taking it too far. Some cell biologists and biochemists who used to work with structural biologists have replaced them with AlphaFold2 — and take its predictions as truth. Sometimes scientists publish papers featuring protein structures that, to any structural biologist, are obviously incorrect, Perrakis said. “And they say: ‘Well, that’s the AlphaFold structure.’” ...

> Jones has heard of scientists struggling to get funding to determine structures computationally. “The general perception is that DeepMind did it, you know, and why are you still doing it?” Jones said. But that work is still necessary, he argues, because AlphaFold2 is fallible.

> “There are very large gaps,” Jones said. “There are things that it can’t do quite clearly.”

savingsPossible 2 days ago

> However, in about 10% of the instances in which AlphaFold2 was “very confident” about its prediction (a score of at least 90 out of 100 on the confidence scale)
I wonder what that confidence score means... If it is 90% probability, then we'd expect it to be wrong 10% of the time
TheBozzCL 3 days ago

Tale as old as ML: people don’t understand it’s an assistance tool, and instead assume it’s always right.
Crazy how we tend to wave away even errors rates of 1% or less. One in a thousand is a lot.
- ben_w 2 days ago
  
  It really depends on the context; (Some) LLMs look impressive specifically because the error rate is comparable to a high score on an exam… the mistake is that even if it was a straight-A student (it's too alien to be that) then it would still only be a student, and we don't put fresh graduates in charge of everything important.
  I don't have the domain knowledge to even guess how good 90% is in molecular biology research.
  - olddustytrail 2 days ago
    
    > we don't put fresh graduates in charge of everything important.
    We put dribbling halfwit morons in charge of everything. I'm thinking of Liz Truss as UK Prime Minister, but I'm sure most countries have their own examples.
    
    ben_w 2 days ago
    
    "Anything" != "Everything", and examples like the Iceberg Lady are usually followed with "and that's why they went bankrupt, so don't do that".
- jagrsw 2 days ago
  
  This tool is designed to be helpful right now. Looking ahead, there's no reason why AI can't eventually match, or even surpass, human intelligence across the board.
  Whether it's advancements in LLMs, with features like long-term memory, or breakthroughs in other areas of ml, it's not guaranteed that humans will remain needed in the research process.
  - Mikhail_K 2 days ago
    
    > Looking ahead, there's no reason why AI can't eventually match, or even surpass, human
    > intelligence across the board.
    There is a reason, actually: what is presently called an "AI" has no concern for the truth. It is a bullshit machine that aims to mimic the right answer.