Our main research interest is to understand the rules of protein folding. This requires (1) to classify all known protein folds in terms of pairwise structure similarities and (2) to find out how the information contained in the amino acid sequences is translated into three-dimensional structures and how these structures endow the proteins with their respective characteristic chemical functions and biological roles. At the Division of Bioinformatics we address these problems by two major research projects. The first is COPS, our Compendium of Protein Structures. In the second project we use the known structures to deduce the atomic and molecular forces that are responsible for the folding and stability of protein structures and we use the resulting energy functions to find and correct errors in experimental protein structures and predict and compute protein structures from their amino acid sequences. There are many interesting problems that need to be solved and there are many as yet undetected rules and features that need to be investigated. The research projects below are suitable for a master thesis. Students will become acquainted with tools and methods of bioinformatics, including (indispensable) programming skills in Python/C. Further information on COPS and our energy functions can be found in our recent publications.
The evolution of sequences and structures
The evolution of proteins is driven by changes in amino acid sequences. At present we know a subset of rules (structure is more conserved than sequence, similar sequences generally have similar structures, etc.) of protein evolution but the picture is still incomplete. The goal of this project is to use sequence and structure alignment techniques to quantify the differences in the conservation of sequences on the one hand and the conservation of structures on the other. Based on our COPS data base this will allow us to characterize and quantify how protein structures adapt to changes in amino acid sequences. This is likely to reveal exciting new insights regarding the relationship between amino acid sequences and protein structures and as such it will be a significant contribution to the advances in structural biology and bioinformatics.
The evolution of protein domains
Proteins can be decomposed into structurally compact subunits called domains. These domains are generally thought to be independent folding units (i.e. the structure only depends on the sequence of the respective domain and is independent of other parts of the protein chain). Moreover, domains frequently duplicate and are exchanged between proteins by recombination processes. Although the structures of such domains are very similar their sequences often seem to be unrelated even if they are contained in the same protein. On the other hand, since the diverged sequences encode for very similar structures there must be some common information on the sequence level that gives rise to this structural similarity. Furthermore, the topic of protein domains is strongly connected to protein function. Frequently, domains are defined on the level of chains (or tertiary structure). However, numerous examples show that distinct compact subunits of protein structures can only be assigned when the physiologically functional protein (quaternary structure, biological assembly) is considered. There are many intriguing questions that will be addressed in this project using COPS and the bioinformatics tools we have in our hands:
- Why are amino acid sequences changing so quickly after domain duplication?
- What is the common sequence information that determines the related structures of domains whose sequences have strongly diverged?
- What can we learn about the evolution of protein structures from domains derived from biological assemblies?
- When building structural domains from biological assemblies, it is crucial that the underlying data is not erroneous. How can we detect faulty biological assemblies? How can correct biological assemblies be built? Is it possible to detect faulty domain decompositions with existing tools?
The experimental determination of protein structures by X-ray crystallography or Nuclear Magnetic Resonance (NMR) spectroscopy is still a tedious and time-consuming task. Hence, a lot of effort has been spent in bioinformatics for the development of algorithms that predict protein structure from sequence. One principle commonly adopted is to identify a protein of known structure that is supposed to resemble the structure of the query sequence. The structure found (the template) is then used to build a model of the unknown structure (the target). Many algorithms for structure prediction rely on a high sequence similarity between target and template, but with the growth of protein structure databases it became clear that there are many similarities amongst protein structures that have only marginal sequence similarity. Fold recognition techniques try to find template structures that would have been missed by common sequence search methods. The goal of this master thesis is to integrate an already existing fold recognition algorithm with the framework of COPS and apply it to the prediction of various interesting protein structures.
Protein structure and function
A basic assumption in protein function prediction is that similar structure implies similar function. In COPS we have a lot of examples of similar structures at various degree of sequence similarity. The topic of this work is an analysis of such cases. At what degree of structural similarity and how often is it possible to derive function? Does sequence similarity play a major role in this game? One application is the annotation of targets from Structural Genomics projects. There are many Structural Genomics targets of unknown function that share structural similarity with already annotated structures. Is it possible to annotate these proteins of unknown function using structural similarity?
COPS and sequence databases
Only a portion of all currently available protein sequences have a known three-dimensional structure. The topic of this work is to provide a list of sequences that have no significant sequence similarity to a sequence with known structure using the COPS classification, thus covering all sequences with no detectable structural relative. The list changes every week because COPS is concurrent with the Protein Data Bank which is weekly updated. Such a list provides difficult sequences for computional methods like fold recognition and protein structure prediction as well as for experiments like Structural Genomics and CASP. A particularly interesting application is the analysis of the human proteome with respect to how much structural information can currently be obtained for its sequences.
Peaks and pitfalls in local protein structure
The goal of this master thesis is to identify unrealistic molecular environments in protein structures (by application of an already implemented method) and *thoroughly* check if these unrealistic environments are due to errors in the process of protein structure determination, format errors in the corresponding Protein Data Bank files or are a consequence of seldomly occurring chemical restraints (non-standard bondings, ligands, experimental conditions like pH value and temperature, ...). The master student working on this thesis should be strongly interested in organic chemistry and literature research concerning the analyzed proteins. The first steps of this work will include a guided introduction to the method that has to be applied and the selection of a number of unrealistic physico-chemical environments in protein structures which shall be analyzed in detail.