Curated by THEOUTPOST
On Thu, 7 Nov, 4:01 PM UTC
2 Sources
[1]
Novel AI framework incorporates experimental data and text-based narratives to accelerate search for new proteins
Harnessing the power of artificial intelligence (AI) and the world's fastest supercomputers, a research team led by the U.S. Department of Energy's (DOE) Argonne National Laboratory has developed an innovative computing framework to speed up the design of new proteins. On the heels of this year's Nobel Prize in Chemistry, which recognized advances in computational protein design, Argonne's AI-driven approach has been selected as a finalist for the prestigious Gordon Bell Prize. Presented by the Association of Computing Machinery, the annual prize recognizes breakthroughs in using high performance computing to solve complex science problems. One of the key innovations of the team's MProt-DPO framework is its ability to integrate different types of data streams, or "multimodal data." It combines traditional protein sequence data with experimental results, molecular simulations and even text-based narratives that provide detailed insights into each protein's properties. This approach has the potential to accelerate protein discovery for a wide range of applications. "Say you want to build a new vaccine or design an enzyme that can break down plastics for recycling in an environmentally friendly way," said Arvind Ramanathan, Argonne computational biologist. "Our AI framework can help researchers zero in on promising proteins from countless possibilities, including candidates that may not exist in nature." Navigating the vast protein design space Mapping a protein's amino acid sequence to its structure and function is a long-standing research challenge. Each unique arrangement of amino acids -- the building blocks of proteins -- can yield different properties and behaviors. The sheer volume of potential variations makes it impractical to test them all through experiments alone. To put this in perspective, modifying just three amino acids in a sequence of 20 creates 8,000 possible combinations. But most proteins are far more complex, with some research targets containing hundreds to thousands of amino acids. "For example, if we change the position of 77 amino acids within a 300-amino-acid protein, we're looking at a design space of a Googol, or 10, unique possibilities," said Gautham Dharuman, Argonne computational scientist and lead author on a paper introducing the framework. "This is why we need large language models and supercomputers to help explore this vast space in a reasonable amount of time." Large language models (LLMs), which form the basis of chatbots like ChatGPT, are AI models that are trained on large amounts of data to detect patterns and generate new information. In the realm of science, LLMs help researchers sift through massive datasets, providing insights and predictions for complex problems like protein design. Leveraging AI and exascale computing power Building and training the framework's LLMs required using powerful supercomputers, including the Aurora exascale system at the Argonne Leadership Computing Facility (ALCF). The ALCF is a DOE Office of Science user facility. "The language models we trained are on the order of a few billion parameters," said Venkat Vishwanath, AI and machine learning team lead at the ALCF. "Supercomputers are crucial not only for training and fine-tuning the models, but also for running the end-to-end workflow. This includes performing large-scale simulations to verify the stability and catalytic activity of the generated protein sequences." In addition to Aurora, the team deployed their framework on other top systems: Frontier at DOE's Oak Ridge National Laboratory, Alps at the Swiss National Supercomputing Centre, Leonardo at CINECA center in Italy and the PDX machine at NVIDIA. They achieved over one exaflop of sustained performance (mixed precision) on each machine, with a peak performance of 5.57 exaflops on Aurora. The Argonne system recently earned the top spot in a measure of AI performance, achieving 10.6 exaflops on the HPL-MxP benchmark. Surpassing an exaflop, which equals a quintillion calculations per second, highlights the immense computational power required for this effort. "By adapting our workflow to run on multiple top supercomputers spanning diverse architectures, we've demonstrated the framework's portability and scalability," Vishwanath said. "This was important because it shows that our tool can be used by researchers regardless of the machine or location." Learning from preferred outcomes The DPO in MProt-DPO stands for Direct Preference Optimization. The DPO algorithm helps AI models improve by learning from preferred or unpreferred outcomes. By adapting DPO for protein design, the Argonne team enabled their framework to learn from experimental feedback and simulations as they happen. "If you think about how ChatGPT works, humans provide feedback on whether a response is helpful or not. That input is looped back into the training algorithm to help the model learn your preferences," Ramanathan said. "MProt-DPO works in a similar way, but we replace human feedback with the experimental and simulation data to help the AI model learn which protein designs are most successful." While generative AI techniques like LLMs have been developed for biological systems, existing tools have been limited by their inability to incorporate multimodal data. MProt-DPO, however, includes experimental data and text-based narratives that give added context to each protein's behavior. This approach builds on earlier work by Ramanathan and colleagues, who created a text-guided protein design framework. "Our motivation was to create a framework that can use LLMs and an end-to-end workflow to generate protein sequences with specific properties of interest such as fitness or catalytic activity," Dharuman said. "DPO then uses these measures as feedback to align the LLMs, enabling them to generate more preferred outcomes in the subsequent iterations. We employed supercomputers to show that we can greatly reduce the time-to-solution by incorporating this feedback in the design process." Ramanathan noted that using experimental data also helps improve the trustworthiness of their AI models. "Bringing validated results into the design loop helps prevent the models from hallucinating wild or unrealistic sequences," he said. "This results in more reliable protein designs." The team tested MProt-DPO on two tasks to demonstrate its ability to handle complex protein design challenges. First, they focused on the yeast protein HIS7, using experimental data to improve the performance of various mutations. For the second task, they worked on malate dehydrogenase, an enzyme that plays a key role in how cells produce energy. Using simulation data, they optimized the design of the enzyme to improve its catalytic efficiency. The team is collaborating with Argonne biologists to validate the AI-generated designs in a laboratory, where initial tests have shown they are performing as expected. Paving the way for AuroraGPT and autonomous discovery The creation of MProt-DPO is also helping to advance Argonne's broader AI for science and autonomous discovery initiatives. The tool's use of multimodal data is central to the ongoing efforts to develop AuroraGPT, a foundation model designed to aid in autonomous scientific exploration across disciplines. "Demonstrating that this approach delivers strong scientific results at extreme scales is an important step toward building more robust AI models," Ramanathan said. "It also moves us closer to autonomous discovery, where AI can help streamline not only experiments but the entire scientific process."
[2]
Argonne team breaks new ground in AI-driven protei | Newswise
Using the MProt-DPO framework, scientists created synthetic versions of malate dehydrogenase that preserve the protein's critical structure and key binding areas. Harnessing the power of artificial intelligence (AI) and the world's fastest supercomputers, a research team led by the U.S. Department of Energy's (DOE) Argonne National Laboratory has developed an innovative computing framework to speed up the design of new proteins. On the heels of this year's Nobel Prize in Chemistry, which recognized advances in computational protein design, Argonne's AI-driven approach has been selected as a finalist for the prestigious Gordon Bell Prize. Presented by the Association of Computing Machinery, the annual prize recognizes breakthroughs in using high performance computing to solve complex science problems. "Demonstrating that this approach delivers strong scientific results at extreme scales is an important step toward building more robust AI models. It also moves us closer to autonomous discovery, where AI can help streamline not only experiments but the entire scientific process." -- Arvind Ramanathan, Argonne computational biologist One of the key innovations of the team's MProt-DPO framework is its ability to integrate different types of data streams, or "multimodal data." It combines traditional protein sequence data with experimental results, molecular simulations and even text-based narratives that provide detailed insights into each protein's properties. This approach has the potential to accelerate protein discovery for a wide range of applications. "Say you want to build a new vaccine or design an enzyme that can break down plastics for recycling in an environmentally friendly way," said Arvind Ramanathan, Argonne computational biologist. "Our AI framework can help researchers zero in on promising proteins from countless possibilities, including candidates that may not exist in nature." Mapping a protein's amino acid sequence to its structure and function is a long-standing research challenge. Each unique arrangement of amino acids -- the building blocks of proteins -- can yield different properties and behaviors. The sheer volume of potential variations makes it impractical to test them all through experiments alone. To put this in perspective, modifying just three amino acids in a sequence of 20 creates 8,000 possible combinations. But most proteins are far more complex, with some research targets containing hundreds to thousands of amino acids. "For example, if we change the position of 77 amino acids within a 300-amino-acid protein, we're looking at a design space of a Googol, or 10, unique possibilities," said Gautham Dharuman, Argonne computational scientist and lead author on a paper introducing the framework. "This is why we need large language models and supercomputers to help explore this vast space in a reasonable amount of time." Large language models (LLMs), which form the basis of chatbots like ChatGPT, are AI models that are trained on large amounts of data to detect patterns and generate new information. In the realm of science, LLMs help researchers sift through massive datasets, providing insights and predictions for complex problems like protein design. Building and training the framework's LLMs required using powerful supercomputers, including the Aurora exascale system at the Argonne Leadership Computing Facility (ALCF). The ALCF is a DOE Office of Science user facility. "The language models we trained are on the order of a few billion parameters," said Venkat Vishwanath, AI and machine learning team lead at the ALCF. "Supercomputers are crucial not only for training and fine-tuning the models, but also for running the end-to-end workflow. This includes performing large-scale simulations to verify the stability and catalytic activity of the generated protein sequences." In addition to Aurora, the team deployed their framework on other top systems: Frontier at DOE's Oak Ridge National Laboratory, Alps at the Swiss National Supercomputing Centre, Leonardo at CINECA center in Italy and the PDX machine at NVIDIA. They achieved over one exaflop of sustained performance (mixed precision) on each machine, with a peak performance of 5.57 exaflops on Aurora. The Argonne system recently earned the top spot in a measure of AI performance, achieving 10.6 exaflops on the HPL-MxP benchmark. Surpassing an exaflop, which equals a quintillion calculations per second, highlights the immense computational power required for this effort. "By adapting our workflow to run on multiple top supercomputers spanning diverse architectures, we've demonstrated the framework's portability and scalability," Vishwanath said. "This was important because it shows that our tool can be used by researchers regardless of the machine or location." The DPO in MProt-DPO stands for Direct Preference Optimization. The DPO algorithm helps AI models improve by learning from preferred or unpreferred outcomes. By adapting DPO for protein design, the Argonne team enabled their framework to learn from experimental feedback and simulations as they happen. "If you think about how ChatGPT works, humans provide feedback on whether a response is helpful or not. That input is looped back into the training algorithm to help the model learn your preferences," Ramanathan said. "MProt-DPO works in a similar way, but we replace human feedback with the experimental and simulation data to help the AI model learn which protein designs are most successful." While generative AI techniques like LLMs have been developed for biological systems, existing tools have been limited by their inability to incorporate multimodal data. MProt-DPO, however, includes experimental data and text-based narratives that give added context to each protein's behavior. This approach builds on earlier work by Ramanathan and colleagues, who created a text-guided protein design framework. "Our motivation was to create a framework that can use LLMs and an end-to-end workflow to generate protein sequences with specific properties of interest such as fitness or catalytic activity," Dharuman said. "DPO then uses these measures as feedback to align the LLMs, enabling them to generate more preferred outcomes in the subsequent iterations. We employed supercomputers to show that we can greatly reduce the time-to-solution by incorporating this feedback in the design process." Ramanathan noted that using experimental data also helps improve the trustworthiness of their AI models. "Bringing validated results into the design loop helps prevent the models from hallucinating wild or unrealistic sequences," he said. "This results in more reliable protein designs." The team tested MProt-DPO on two tasks to demonstrate its ability to handle complex protein design challenges. First, they focused on the yeast protein HIS7, using experimental data to improve the performance of various mutations. For the second task, they worked on malate dehydrogenase, an enzyme that plays a key role in how cells produce energy. Using simulation data, they optimized the design of the enzyme to improve its catalytic efficiency. The team is collaborating with Argonne biologists to validate the AI-generated designs in a laboratory, where initial tests have shown they are performing as expected. The creation of MProt-DPO is also helping to advance Argonne's broader AI for science and autonomous discovery initiatives. The tool's use of multimodal data is central to the ongoing efforts to develop AuroraGPT, a foundation model designed to aid in autonomous scientific exploration across disciplines. "Demonstrating that this approach delivers strong scientific results at extreme scales is an important step toward building more robust AI models," Ramanathan said. "It also moves us closer to autonomous discovery, where AI can help streamline not only experiments but the entire scientific process." The team's research was supported by the DOE Office of Science's Advanced Scientific Computing Research program and the National Institutes of Health. Additional team members include Argonne's Kyle Hippe, Alexander Brace, Sam Foreman, Väinö Hatanpää, Varuni K. Sastry, Huihuo Zheng, Logan Ward, Servesh Muralidharan, Archit Vasan, Bharat Kale, Carla M. Mann, Heng Ma, Murali Emani, Michael E. Papka, Ian Foster and Rick Stevens; Yun-Hsuan Cheng, Yuliana Zamora and Tom Gibbs from NVIDIA; Shengchao Liu from the University of California, Berkeley; Chaowei Xiao from the University of Wisconsin-Madison; Mahidhar Tatineni from the San Diego Supercomputing Center; Deepak Canchi, Jerome Mitchell, Koichi Yamad and Maria Garzaran from Intel; and Anima Anandkumar from the California Institute of Technology. The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy's (DOE's) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science. Argonne National Laboratory seeks solutions to pressing national problems in science and technology by conducting leading-edge basic and applied research in virtually every scientific discipline. Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science. The U.S. Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.
Share
Share
Copy Link
Researchers at Argonne National Laboratory have developed MProt-DPO, an innovative AI framework that accelerates protein design by integrating multimodal data and leveraging supercomputers, achieving exascale performance.
Researchers at the U.S. Department of Energy's Argonne National Laboratory have developed a cutting-edge AI framework called MProt-DPO, which promises to revolutionize protein design. This innovative approach combines artificial intelligence with exascale computing power to accelerate the discovery and creation of new proteins for various applications [1][2].
One of the key innovations of MProt-DPO is its ability to integrate different types of data streams, or "multimodal data." The framework combines:
This comprehensive approach allows researchers to explore a vast number of protein possibilities more efficiently than ever before [1].
MProt-DPO utilizes large language models (LLMs) similar to those powering chatbots like ChatGPT. These AI models are trained on massive datasets to detect patterns and generate new information. The framework's LLMs, containing billions of parameters, required the use of powerful supercomputers for training and deployment [1][2].
The team used multiple top supercomputing systems, including:
The framework achieved over one exaflop of sustained performance on each machine, with a peak performance of 5.2 exaflops on Aurora [1].
The DPO in MProt-DPO stands for Direct Preference Optimization, an algorithm that enables AI models to learn from preferred or unpreferred outcomes. In the context of protein design, this allows the framework to continuously improve by learning from experimental feedback and simulations in real-time [1][2].
The MProt-DPO framework has the potential to accelerate protein discovery for a wide range of applications, including:
By enabling researchers to explore a vast design space of protein possibilities, including candidates that may not exist in nature, this AI-driven approach could lead to significant breakthroughs in various fields [1][2].
The Argonne team's work has been selected as a finalist for the prestigious Gordon Bell Prize, recognizing its potential impact on using high-performance computing to solve complex scientific problems. This achievement, coming on the heels of the 2023 Nobel Prize in Chemistry for advances in computational protein design, underscores the growing importance of AI and supercomputing in biological research [1][2].
As the framework continues to develop, it may pave the way for more autonomous scientific discovery, where AI can streamline not only experiments but the entire scientific process, potentially leading to faster and more efficient breakthroughs in protein engineering and related fields [2].
Researchers develop EVOLVEpro, an AI tool that significantly enhances protein engineering capabilities, potentially transforming medical treatments and addressing global challenges.
3 Sources
Google DeepMind introduces AlphaProteo, an AI model capable of generating novel proteins for biological and medical research. This breakthrough has the potential to accelerate drug discovery and enhance our understanding of protein structures.
3 Sources
Researchers at Linköping University have enhanced AlphaFold, enabling it to predict very large and complex protein structures while incorporating experimental data. This advancement, called AF_unmasked, marks a significant step towards more efficient protein design for medical and scientific applications.
2 Sources
Researchers from the University of Virginia have developed an AI-driven framework called DeepUrfold that uncovers hidden relationships in protein structures, potentially transforming our understanding of protein evolution and function.
2 Sources
SLAC National Accelerator Laboratory is leveraging AI to enhance various aspects of scientific research, from drug discovery to particle physics, demonstrating the growing importance of AI in advancing cutting-edge science.
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved