TY - JOUR
T1 - PROVAL
T2 - A framework for comparison of protein sequence embeddings
AU - Väth, Philipp
AU - Münch, Maximilian
AU - Raab, Christoph
AU - Schleif, F. M.
N1 - Publisher Copyright:
© 2022 The Author(s)
PY - 2022/6
Y1 - 2022/6
N2 - High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith–Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.
AB - High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith–Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.
KW - Deep learning
KW - Dissimilarity representation
KW - Protein sequence embeddings
KW - Representation learning
KW - Smith–Waterman
UR - http://www.scopus.com/inward/record.url?scp=85147154632&partnerID=8YFLogxK
U2 - 10.1016/j.jcmds.2022.100044
DO - 10.1016/j.jcmds.2022.100044
M3 - Article
AN - SCOPUS:85147154632
SN - 2772-4158
VL - 3
JO - Journal of Computational Mathematics and Data Science
JF - Journal of Computational Mathematics and Data Science
M1 - 100044
ER -