Revisiting the Protein Fold Recognition Problem Usin Machine Learning Approaches

 

Background

Protein structure elucidation is key to understanding the intricate workings of biological systems. Among the challenges in structural biology, predicting protein folds has emerged as a focal point, with implications for drug design, disease understanding, and functional annotation. Recent advancements in machine learning (ML) techniques, coupled with the extraction of diverse protein features from protein sequences, have elevated the field of protein fold prediction to new levels.

Protein fold recognition Problem

Protein fold recognition can be formulated as a multi-class classification problem. In this context, each fold represents a class, and the goal is to classify a given protein sequence into one of these classes based on certain features extracted from its protein sequences. In machine learning terms:

  • Classes: Each protein fold is treated as a class ( about 1000 folds)
  • Features: Various features derived from the protein's primary structure (amino acid sequence) are used as input for the classification model.
  • Objective: The objective is to train a model that, given a new or unseen protein sequence, accurately predicts the class (fold) to which it belongs.

Thesis

This project delves into the landscape of protein fold recognition by examining the numerous ML techniques used and the wide range of sequence-based protein features utilized for increased accuracy. By synthesizing insights from a range of ML approaches and considering the nuanced features inherent in protein sequences, this project aims to provide a comprehensive overview of the current state-of-the-art in protein fold prediction.

Moreover, this project mainly looks at the successful geometric kernel data fusion (GKF) method for recognizing protein folds. It also looks at a weighted version of GKF and compares its performance to the best deep learning-based approaches for predicting protein folds. Furthermore, the need for a comprehensive benchmark for evaluating these methods is highlighted. Through a critical analysis of methodologies and their applications, this project aims to contribute to the collective understanding of protein folding and offer insights into the future directions of this rapidly evolving field.

Candidate

No prior knowledge of biology is needed. A student with an interest in biology, computational biology, and bioinformatics, as well as experience in computer science or informatics, would be ideal for this research. Python, R, Matlab, or another programming language is needed to progress with the task of this thesis.

Supervisor

Feel free to get in touch with Pooya Zakeri for further information about this project if you're interested. Tasks may be developed in a variety of ways, depending on the applicants.

 

Literature

[1] https://doi.org/10.1093/bioinformatics/btu118

[2] https://doi.org/10.1093/bioinformatics/btz040

[3] https://doi.org/10.1186/s12859-020-3504-z

 

 

 

 

 

 

 

Publisert 9. okt. 2023 22:03 - Sist endret 9. okt. 2023 22:03

Veileder(e)

Omfang (studiepoeng)

60