Using machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This code explores methods of quantifying potential biases and examines some common benchmark datasets. -
View it on GitHub