The Use of Random Forest to Improve Molecular Docking Tools for Drug Design

Dr. Pedro J. Ballester
France, Cancer Research Centre of Marseille
Docking is a structure-based computational tool that can be used to predict the strength with which a small ligand molecule binds to a macromolecular target. Such binding affinity prediction is crucial to design molecules that bind more tightly to a target and thus are more likely to provide the most efficacious modulation of its biochemical function. Despite intense research over the years, improving this type of predictive accuracy has proven to be a very challenging task for any class of method.
New scoring functions based on non-parametric machine-learning regression models, which are able to exploit effectively much larger volumes of experimental data and circumvent the need for a predetermined functional form, have become the most accurate to predict binding affinity of diverse proteinligand complexes. This talk will review work on the inception and further development of RF- Score [1], which was the first machine-learning scoring function to achieve a substantial improvement over classical scoring functions at binding affinity prediction. RF-Score employs Random Forest (RF) regression to relate a structural description of the complex with its binding affinity. The talk will cover adequate benchmarking practices [2], expert-based versus data-driven feature selection [3], further improvements [4], how prediction is not only dictated by training set size but also data representation and regression technique [6] and RF-Score software availability (including a user-friendly docking webserver [5] and a standalone executable for rescoring docked poses[6]). Some work has also been made on the application of RF-Score to the related problem of virtual screening, e.g. a prospective virtual screening study [7]. This will be briefly discussed and promising avenues for future work outlined.
  1. Ballester, P. J. & Mitchell, J. B. O. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26, 1169–1175 (2010).
  2. Ballester, P. J. & Mitchell, J. B. O. Comments on “leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets”: significance for the validation of scoring functions.
  3. Chem. Inf. Model. 51, 1739–1741 (2011).
  4. Ballester, P. J., Schreyer, A. & Blundell, T. L. Does a More Precise Chemical Description of Protein-Ligand Complexes Lead to More Accurate Prediction of Binding Affinity? J. Chem. Inf. Model. 54, 944–955 (2014).
  5. Li, H., Leung, K.-S., Wong, M.-H. & Ballester, P. J. Substituting random for- est for multiple linear regression improves binding affinity prediction of scor- ing functions: Cyscore as a case study. BMC Bioinformatics 15, 291 (2014).
  6. Li, H., Leung, K.-S., Ballester, P. J. & Wong, M.-H. istar: A Web Platform for Large-Scale Protein-Ligand Docking. PLoS One 9, e85678 (2014).
  7. Li, H., Leung, K.-S., Wong, M.-H. & Ballester, P. J. Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Pre- diction by the Effective Exploitation of Larger Data Sets. Mol. Inform. 34, 115–126 (2015).
  8. Ballester, P. J. et al. Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification. J. R. Soc. Inter- face 9, 3196–3207 (2012).