Understanding and learning to speak the language of life
The structure of proteins is similiar to a language. Their code can be deciphered using language models.
The CS-2-System of Cerebras Systems at the Leibniz Computing Centre helps bioinformaticians to decipher the codes of proteins. This can be used to develop new cures or solve pressing environmental problems.
Proteins are the building blocks of life. They determine the shape, structure and function of cells, tissues and organs, as well as the metabolism and growth of humans, animals and plants. The majority of proteins are made up of about 20 amino acids, according to the results of computer-based analyses over the past few decades. Although billions of amino acid combinations and variations of protein sequences are now collected in many databases, how these proteins ensure growth or influence cell functions is still largely a mystery. And it takes far too long to search the bulging databases. But artificial intelligence (AI) and pattern recognition, and more recently models for processing human language, are helping to crack the code of life. They are also speeding up the search: "Language models learn patterns and similarities in sequences directly from protein databases," explains Dr. Michael Heinzinger, bioinformatician at the Technische Universität München (TUM) and member of the Rost Lab at the Department of Bioinformatics and Computational Biology of Prof. Burkhard Rost. "Traditional, statistical pattern recognition usually takes a long time and does not work equally well for all proteins. Language models shorten the search and analysis process and give us new tools to improve our understanding of proteins".
Language models crack the protein code
Searching for specific protein structures based on recurring patterns can take days, if not weeks. About four years ago, Prof. Burkhard Rost's team discovered the analogy between human language and proteins: The 20 most important amino acids act like letters, forming words and sentences, or rather proteins and sequences with their own functions. The Rost lab believes that these can be deciphered using smart natural language processing programmes. Instead of using Wikipedia articles as usual, ELMo, the Embeddings from Language Model, was fed protein sequences and trained. The result was SeqVec (Sequence-to-Vector), the first intelligent model for processing protein code, which no longer searches databases for patterns, but picks up the protein code directly and automatically transfers it to new proteins. Following its lead, researchers around the world trained with large language models (LLM) and developed better, more efficient training models for proteins with new functionalities. New technologies for AI processes fuelled this development.
The graphic illustrates how language models work,
which add identifiers for amino acids like letters
to sequences or words
In particular, transformers - which allow the computer to convert strings of letters and characters into mathematical vectors - have expanded the possibilities for training NLP models. For protein analysis, the TUM Chair of Bioinformatics published several transformers trained on protein sequences, called ProtTrans. These first protein code models were initially created on the department's own computer cluster as well as on the AI systems of the Leibniz-Rechenzentrum (LRZ), i.e. on parallel Graphics Processing Units (GPU), processors specially optimised for machine learning. Since last year, the scientists at the LRZ have also had a Cerebras CS-2 system at their disposal. "This is a completely different AI system," says Heinzinger. "With its large chip and high memory capacity, it simplifies many steps in distributed training. For example, we no longer have to worry about communication between processors and nodes. Models can be optimised more quickly and, according to Heinzinger and his colleagues' initial experience, training with large amounts of data is also accelerated. However, fundamental changes or innovations to a model were comparatively difficult. "But that is only a matter of time, the CS-2 system requires further components for deployment in research, and with these comes the revised software stack," says Nicolay Hammer, a PhD astro physist and the head of the Big Data & AI team at LRZ.
The new, better and more specialised language models will also decipher the syntax and grammar of the protein code and show how and why proteins form three-dimensional structures, fold and what they do. "We can use larger language models not only to better understand proteins, but also to specifically manipulate or rewrite them to meet the challenges of the 21st century," says Heinzinger. "Due to their wide range of functions, proteins are indispensable in many pharmacological and biotechnological processes. They are used, for example, to produce drugs and, more recently, biofuels or materials that break down plastics or bind carbon. And anyone who masters the code of proteins can use them to create an unlimited number of new molecules or substances.
Computing power shortens training
Language and protein structures are similar, but of course there are also differences. In order to teach computers and smart language processing the peculiarities of proteins, the specialists at the Rost Lab first confronted ELMo and other transformer-based language models with data sets that now contain up to 2.3 billion or more protein sequences. Normally, language models register combinations of letters and words, but in this case they register when which amino acids line up. Similar to a cloze text in school, the programmes then fill in artificially created blanks and thus prove that they can understand the structure of the proteins. If the human brain searches for missing words with intuition and sense, the machine sorts solutions according to statistical distribution. It names different variations for the gap according to the probabilities with which they occur in proteins. "The better computers can read a protein sequence, the better they can understand its structures and functions," says Heinzinger. Step by step, parameters, the criteria for selection and analysis, are then changed or added.
Several dozen training runs are the norm. In addition, the models and neural networks grow with each parameter: while SeqVec contained 93 million neurons, the latest large model from the Rost lab, ProtT5, already contains three billion: "To train a large language model is expensive, it takes longer with each run because the data sets grow, but also the number of parameters. This requires computing power and energy," says bioinformatician Heinzinger. "But in the best case, users in biotechnology, pharmacology or medicine will have a useful model for their own analyses. Depending on the available computing power and the size of the models, training runs can take hours or several weeks. Data is repeatedly fed through processors and memory units, recombined, evaluated, checked, stored and further processed. While the LRZ AI cluster has 68 GPUs connected in parallel, each with between 16 and 80 gigabytes of dynamic random access memory (DRAM), the Cerebras system offers a single chip on which a total of 850,000 work units share around 40 gigabytes of memory. This is connected to HPE Superdome Flex servers, which bring a further 12 terabytes of random access memory (RAM) and about 100 terabytes of data storage. So the system adds up to computing power that would otherwise require dozens of GPUs. On this supercomputer, data can quickly flow between compute and storage units. "Training with ProtT5 would no longer be possible on the LRZ-AI systems, it would take years," says Heinzinger. "In contrast to pattern recognition, which would have to search a database again and again for each new protein, language models are trained directly on the raw data from a database. The protein codes recorded in this way can then be directly applied to new, different proteins.
He is currently trying to implement different, established language models on the Cerebras CS-2-System and to test their suitability for questions from biology and other life sciences. However, this is less about analysing proteins than getting to grips with a new technology and adapting the system to his own requirements.
"Language models open up new possibilities, such as the generation of texts or protein sequences," says Heinzinger. Specialists can not only build new substances, materials or medicines with the decoded protein sequences, once trained, models like ProtT5 also offer a basis for the development of software: for example, for the analysis of organisms and tissue, for the recognition and treatment of mutations or diseases, and also for the development and design of new organic substances and materials. For this, too, the high effort of the training is worthwhile. (vs)
Dr. Michael Heinzinger, bioinformatician oft Rost-Lab's team at TUM