Protein function prediction using machine learning methods

We are developing machine learning-based methods to predict protein functions.

Development of a biosynthetic gene cluster database with functional annotations

The Reference Sequence Database (RefSeq) of National Center for Biotechnology Information (NCBI) registers more than 300 million protein sequences. In contrast, SwissProt, a protein function database of UniProt Knowledgebase (UniProtKB) maintained by European Bioinformatics Institute (EBI) registers only 570,000 entries of proteins with functional annotations curated by experts. In this study, we are aiming at predicting the functions of a vast number of function-unknown proteins and finding novel enzymes that catalyze useful reactions.
We have developed a database that registers the results of the function prediction and is publicly accessible via a Web server at http://sr.iu.a.u-tokyo.ac.jp/ (Fig. 1). In addition to the results of function prediction based on sequence similarity, structural models predicted by AlphaFold are available at the Web server. We will extend the database to include the function prediction results by a machine learning-based method which we are now developing.

Fig. 1: Overview of the database. The database registers the results of sequence similarity search for the protein sequences in a biosynthetic gene cluster database, MIBiG, against the sequences in PDB and SwissProt. Each protein data in the database is linked to the enzymatic reaction data in Rhea if available. At the Web server, a user can search the database not only with MIBiG accession or Protein ID, but also with the name or the substructure of compound that participates in the enzymatic reaction.