Tanimoto Random Features for Scalable Molecular Machine Learning
Authors: Austin Tripp, Sergio Bacallado, Sukriti Singh, José Miguel Hernández-Lobato
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks. In this section we apply the techniques in this paper to realistic datasets of molecular fingerprints. |
| Researcher Affiliation | Academia | Austin Tripp Unversity of Cambridge ajt212@cam.ac.uk Sergio Bacallado Unversity of Cambridge sb2116@cam.ac.uk Sukriti Singh University of Cambridge ss2971@cam.ac.uk José Miguel Hernández-Lobato University of Cambridge jmh233@cam.ac.uk |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes mathematical derivations and methods in prose and equations rather than structured code. |
| Open Source Code | Yes | Code to reproduce all experiments is available at: https://github.com/Austin T/tanimoto-random-features-neurips23. |
| Open Datasets | Yes | We choose to study a sample of 1000 small organic molecules from the Guaca Mol dataset (Brown et al., 2019; Mendez et al., 2019) which exemplify the types of molecules typically considered in drug discovery projects. Specifically, we study 5 tasks from the DOCKSTRING benchmark which entail predicting protein binding affinity from a molecular graph structure (García-Ortegón et al., 2022). |
| Dataset Splits | No | The paper mentions “training and test sets” but does not provide specific details on the percentage or number of samples used for each split, nor does it explicitly mention a distinct validation set or how it was separated. |
| Hardware Specification | No | The paper states: “The compute costs of the experiments in this paper were quite modest and were run on a single machine with no GPU usage.” However, it does not specify any detailed hardware components like CPU model, RAM, or storage, which are necessary for full reproducibility. |
| Software Dependencies | No | The paper lists software packages used: “All experiments were performed in python using the numpy (Harris et al., 2020), pytorch (Paszke et al., 2019), gpytorch (Gardner et al., 2018), and rdkit (Landrum et al., 2023) packages.” However, it does not provide specific version numbers for Python or any of these listed libraries, which is essential for reproducible software dependencies. |
| Experiment Setup | Yes | We use M = 5000 random features for all methods. Molecules were represented with both binary (B) and count (C) Morgan fingerprints... of dimension 1024. Our GP models use a constant mean and Gaussian noise. The variational parameters are optimized via natural gradient descent with a learning rate of 10−1 and a batch size of 2M = 10 000 for one pass through the dataset. |