PROVAL: A framework for comparison of protein sequence embeddings

Philipp Väth*, Maximilian Münch, Christoph Raab, F. M. Schleif

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

8 Citations (Scopus)

Abstract

High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith–Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.

Original languageEnglish
Article number100044
JournalJournal of Computational Mathematics and Data Science
Volume3
DOIs
Publication statusPublished - Jun-2022

Keywords

  • Deep learning
  • Dissimilarity representation
  • Protein sequence embeddings
  • Representation learning
  • Smith–Waterman

Fingerprint

Dive into the research topics of 'PROVAL: A framework for comparison of protein sequence embeddings'. Together they form a unique fingerprint.

Cite this