TY - JOUR
T1 - SELFIES and the future of molecular string representations
AU - Krenn, Mario
AU - Ai, Qianxiang
AU - Barthel, Senja
AU - Carson, Nessa
AU - Frei, Angelo
AU - Frey, Nathan C.
AU - Friederich, Pascal
AU - Gaudin, Théophile
AU - Gayle, Alberto Alexander
AU - Jablonka, Kevin Maik
AU - Lameiro, Rafael F.
AU - Lemm, Dominik
AU - Lo, Alston
AU - Moosavi, Seyed Mohamad
AU - Nápoles-Duarte, José Manuel
AU - Nigam, Akshat Kumar
AU - Pollice, Robert
AU - Rajan, Kohulan
AU - Schatzschneider, Ulrich
AU - Schwaller, Philippe
AU - Skreta, Marta
AU - Smit, Berend
AU - Strieth-Kalthoff, Felix
AU - Sun, Chong
AU - Tom, Gary
AU - Falk von Rudorff, Guido
AU - Wang, Andrew
AU - White, Andrew D.
AU - Young, Adamo
AU - Yu, Rose
AU - Aspuru-Guzik, Alán
N1 - Publisher Copyright:
© 2022 The Author(s)
PY - 2022/10/14
Y1 - 2022/10/14
N2 - Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (SELFIES). SELFIES has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
AB - Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (SELFIES). SELFIES has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
UR - https://www.scopus.com/pages/publications/85139834285
U2 - 10.1016/j.patter.2022.100588
DO - 10.1016/j.patter.2022.100588
M3 - Review article
AN - SCOPUS:85139834285
SN - 2666-3899
VL - 3
JO - Patterns
JF - Patterns
IS - 10
M1 - 100588
ER -