Deep learning neural network approaches for one-dimensional structure prediction from protein sequences
Shamima Banu Syed Mohamed Rashid
Date of Issue2018
School of Computer Science and Engineering
Bioinformatics Research Centre
Proteins are macromolecules that carry out important processes in the cells of living organisms, such as signalling, transport, cell replication and catalysis performed by enzymes. Loss of protein activity through problems caused by mutations or misfolding results in malfunction and is the cause of many diseases. In particular mem- brane proteins, of which the α−helical transmembrane (TM) proteins form the major- ity of folds, are important drug targets involved in critical roles such as cell signalling and metabolism, amongst other functions. A protein comprises of a chain of amino acid residues that fold into specific 3−D conformations to achieve their specified func- tion. However, because it is difficult and resource intensive to obtain the structures of proteins experimentally, it is desirable to predict protein structures directly from sequence. Machine learning methods in the past few decades, have been increasingly used as a complement to experimental approaches to predict structures from protein sequences. The accurate prediction of protein secondary structures and in the case of membrane proteins, the correct detection of the residues’ environment preference as well as the topological orientation with respect to the membrane are intermediary steps to aid the full 3−D structure determination. Current protein Secondary Structure (SS) prediction methods such as neural network based approaches report cross-validated accuracies of about 81% for single predictors and up to 85% in the case of consensus based models that combine individual predic- tors. Nevertheless, the theoretical upper limit for SS prediction is estimated to be at 88-90% and remains a challenge to be reached. As even small errors at the secondary structure level can compound into large deviations in the final, tertiary structures, there is a strong motivation to develop better techniques to handle the SS prediction. Similarly for the prediction of membrane protein (MP) topology, while earlier pro- posed works regularly reported topology scores of 80% or higher, with increasing size and complexity of topological formations seen in current membrane protein datasets, the recent scores reported by individual predictors are typically around 60% to 70%, although consensus based methods have reported up to 80%. A common feature in both SS and MP topology prediction problems is the representation of protein sequences as Position Specific Scoring Matrices (PSSM). While they possess the advantage of evolutionary information, it is difficult to accurately represent a sequence with PSSM due to challenges of the sliding window scheme and the overlapping signals found in the sequence space. Additionally in the case of SS prediction, methods are trained on large numbers of proteins for the cross-validation based models which may result in a loss of generalization. Here, it is of interest to obtain a more efficient training set that may generalize well to unseen datasets. Ways of extracting higher level features from the PSSM by using Deep Belief Networks are explored. Additionally, an energy based feature representation procedure, in which a protein sequence is encoded with probabilities derived from energy potentials using the previously developed Cα-Cβ-Side Group protein model (CABS) algorithm is em- ployed. Finally, the use of the Complex-plane in the SS prediction is also explored. Complex-valued neurons had shown to be computationally powerful in classification tasks and have been used here to predict protein secondary structures as well as to develop a heuristics based procedure to develop a training model distinct from conven- tionally employed cross-validation approaches. The proposed works and main findings in this thesis are (i) the prediction of membrane protein topology and detection of signal peptide and globular sequences, using a Hierarchical Deep Ensemble Network (HiDEN) from PSSM based encoding (ii) Deep Belief Networks (DBN) applied to predict protein secondary structures from PSSM based encoding and CABS algorithm based feature encoding (iii) the prediction of protein secondary structures using a Fully Complex-valued Re- laxation Network (FCRN) using cross-validated training models together with CABS algorithm based feature encoding (iv) the development of a heuristics based procedure to determine a small training set of proteins, known as the Compact Model, whose accuracies are similar to cross-validated models, yet remains capable of generalizing to new datasets (v) structural analysis of predicted SS to detect the role of hydrogen bonds in residue mis-classification rates. In particular, the effect of water-mediated vs. peptide- backbone hydrogen bonding is investigated, with the finding that residues with distinct hydrogen bonding patterns prove a challenge for many existing predic- tors, even if such residues are included in the training set. The findings of the thesis can be extended in several ways. Firstly, for the use of PSSM based encoding, the proposed models may be improved by incorporating other sequence and structure-based properties of interest, such as the amino acid compo- sition, accessible surface areas of residues and so forth. Secondly, in the case of MP structure prediction, the current three or four state residue classification system could be extended to cover more features of interest such as kinks and re-entrant helices, that remain un-explored. Lastly, the heuristics based procedure to obtain the compact model can be extended in a systematic way by applying automated sample selection strategies, such that the best learning model given any training set is automatically selected.
DRNTU::Engineering::Computer science and engineering::Computer applications::Life and medical sciences