Identifying Protein Folds through Matrix Factorization Approaches

Background

Protein fold recognition involves determining a protein's structural class or fold, typically when the actual 3D structure is unknown. Various ML methods use different features or representations of protein sequences to classify protein sequences into known fold classes. While many recent protein structure prediction methods, such as AlphaFold introduced by DeepMind, yield impressive outcomes, they are focused solely on predicting structure, offering little insight into the folding process. Moreover, even cutting-edge algorithms proposed for fold recognition struggle with accurate predictions for challenging protein targets, particularly in cases with limited or no homologous references. Consequently, the precise identification of protein folds remains a crucial inquiry, holding the potential for a rapid understanding of protein functionality when optimized.

Protein fold recognition Problem

Protein fold recognition can be formulated as a multi-class classification problem. In this context, each fold represents a class, and the goal is to classify a given protein sequence into one of these classes based on certain features extracted from its protein sequences. 

In machine learning terms:

  • Classes: Each protein fold is treated as a class ( about 1000 folds)
  • Features: Various features derived from the protein's primary structure (amino acid sequence) are used as input for the classification model.
  • Objective: The objective is to train a model that, given a new or unseen protein sequence, accurately predicts the class (fold) to which it belongs.

Thesis

Most protein fold prediction models typically address fewer than 100 folds, significantly fewer than the total number of identified protein folds. Furthermore, common approaches to protein fold recognition often overlook the interrelationship between protein folds. These gaps in current methods lead this project to formulate protein fold recognition as a factorization of an incompletely filled binary protein-fold matrix, aiming to predict unknown values. In particular, the SCOP database, a protein fold recognition database, can be conceptualized as an incomplete matrix (M×N), where each row represents a protein and each column a protein fold. 

Matrix factorization approaches such as Bayesian Probabilistic Matrix Factorization (BPMF) and Non-Negative Matrix Factorization (NMF) allow us to complete the protein-fold matrix. This project focuses on matrix factorization techniques that enable the integration of side information into the process of factorization. In particular, many sequence-based protein features, such as evolutionary and predicted structural information, can be employed as side information to guide the factorization process.

Candidate

No prior knowledge of biology is needed. A student with an interest in biology, computational biology, and bioinformatics, as well as experience in computer science or informatics, would be ideal for this research. Python, R, Matlab, or another programming language is needed to progress with the task of this thesis. Having knowledge of Julia is beneficial, but it's not a requirement or necessity.

Supervisor

Feel free to get in touch with Pooya Zakeri for further information about this project if you're interested. Tasks may be developed in a variety of ways, depending on the applicants.

Literature

[1] https://doi.org/10.1093/bioinformatics/bty289

[2] https://doi.org/10.1093/bioinformatics/btu118

Publisert 9. okt. 2023 22:07 - Sist endret 9. okt. 2023 22:07

Veileder(e)

Omfang (studiepoeng)

60