Deep Confident Steps to New Pockets: Strategies for Docking Generalization
Authors: Gabriele Corso, Arthur Deng, Nicholas Polizzi, Regina Barzilay, Tommi S. Jaakkola
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, we develop DOCKGEN, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. |
| Researcher Affiliation | Academia | Gabriele Corso 1, Arthur Deng 2, Nicholas Polizzi3, Regina Barzilay1, Tommi Jaakkola1 1CSAIL, Massachusetts Institute of Technology, 2University of California, Berkeley, 3Dana-Farber Cancer Institute and Harvard Medical School |
| Pseudocode | No | The paper describes the training procedure and algorithms in paragraph text, but no formally labeled "Pseudocode" or "Algorithm" blocks are present. |
| Open Source Code | Yes | DIFFDOCK-L, which we release publicly2. 2We release data, instructions, code, and weights at https://github.com/gcorso/DiffDock. |
| Open Datasets | Yes | The majority of previous ML-based methods used the PDBBind dataset [Liu et al., 2017], a curated set of protein-ligand crystallographic structures from PDB [Berman et al., 2003], to train and test models. ... To obtain a more sizable test set without retraining the models on a reduced set, we turn to the Binding MOAD dataset [Hu et al., 2005]. ... We start from a large collection of protein structures that comprise the Protein MPNN [Dauparas et al., 2022] training set. |
| Dataset Splits | Yes | To generate the validation and test datasets of the new benchmark, we randomly divide these remaining clusters in two and then apply a number of further filtering steps (more details in Appendix A). ... This leaves us with 141 complexes in the validation and 189 complexes in the test set. ... These parameters were selected by testing the method on the 5 DOCKGEN validation clusters. |
| Hardware Specification | Yes | With these parameters and training on one NVIDIA A6000 GPU, the average run time is 8 hours. |
| Software Dependencies | No | The paper mentions software like PyRMSD [Meli & Biggin, 2020] but does not specify version numbers for any libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Hyperparameters. In our experiments, we chose confidence cutoff k = 4, number of complexes sampled from PDBBind p = 100, number of complexes sampled from the buffer m = 100, number of inference samples q = 32, and maximum samples per protein/ligand pair n = 20. At the rollout step, we ran 4 inference steps each with 8 samples and compute the RMSD less than 2 A metric with the top-ranked pose from each inference step to reduce variance in the reported metric. Additionally, we set number of inference samples to 80 for the first bootstrapping inference step to fill the initially empty buffer. These parameters were selected by testing the method on the 5 DOCKGEN validation clusters. |