TY - JOUR
T1 - Regression-Based Active Learning for Accessible Acceleration of Ultra-Large Library Docking
AU - Marin, Egor
AU - Kovaleva, Margarita
AU - Kadukova, Maria
AU - Mustafin, Khalid
AU - Khorn, Polina
AU - Rogachev, Andrey
AU - Mishin, Alexey
AU - Guskov, Albert
AU - Borshchevskiy, Valentin
PY - 2024/4/8
Y1 - 2024/4/8
N2 - Structure-based drug discovery is a process for both hit finding and optimization that relies on a validated three-dimensional model of a target biomolecule, used to rationalize the structure–function relationship for this particular target. An ultralarge virtual screening approach has emerged recently for rapid discovery of high-affinity hit compounds, but it requires substantial computational resources. This study shows that active learning with simple linear regression models can accelerate virtual screening, retrieving up to 90% of the top-1% of the docking hit list after docking just 10% of the ligands. The results demonstrate that it is unnecessary to use complex models, such as deep learning approaches, to predict the imprecise results of ligand docking with a low sampling depth. Furthermore, we explore active learning meta-parameters and find that constant batch size models with a simple ensembling method provide the best ligand retrieval rate. Finally, our approach is validated on the ultralarge size virtual screening data set, retrieving 70% of the top-0.05% of ligands after screening only 2% of the library. Altogether, this work provides a computationally accessible approach for accelerated virtual screening that can serve as a blueprint for the future design of low-compute agents for exploration of the chemical space via large-scale accelerated docking. With recent breakthroughs in protein structure prediction, this method can significantly increase accessibility for the academic community and aid in the rapid discovery of high-affinity hit compounds for various targets.
AB - Structure-based drug discovery is a process for both hit finding and optimization that relies on a validated three-dimensional model of a target biomolecule, used to rationalize the structure–function relationship for this particular target. An ultralarge virtual screening approach has emerged recently for rapid discovery of high-affinity hit compounds, but it requires substantial computational resources. This study shows that active learning with simple linear regression models can accelerate virtual screening, retrieving up to 90% of the top-1% of the docking hit list after docking just 10% of the ligands. The results demonstrate that it is unnecessary to use complex models, such as deep learning approaches, to predict the imprecise results of ligand docking with a low sampling depth. Furthermore, we explore active learning meta-parameters and find that constant batch size models with a simple ensembling method provide the best ligand retrieval rate. Finally, our approach is validated on the ultralarge size virtual screening data set, retrieving 70% of the top-0.05% of ligands after screening only 2% of the library. Altogether, this work provides a computationally accessible approach for accelerated virtual screening that can serve as a blueprint for the future design of low-compute agents for exploration of the chemical space via large-scale accelerated docking. With recent breakthroughs in protein structure prediction, this method can significantly increase accessibility for the academic community and aid in the rapid discovery of high-affinity hit compounds for various targets.
U2 - 10.1021/acs.jcim.3c01661
DO - 10.1021/acs.jcim.3c01661
M3 - Article
SN - 1549-9596
VL - 64
SP - 2612
EP - 2623
JO - Journal of chemical information and modeling
JF - Journal of chemical information and modeling
IS - 7
M1 - 3c01661
ER -