reproducibilityindex.ai

Tanimoto Random Features for Scalable Molecular Machine Learning

Authors: Austin Tripp, Sergio Bacallado, Sukriti Singh, José Miguel Hernández-Lobato

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks. In this section we apply the techniques in this paper to realistic datasets of molecular fingerprints.
Researcher Affiliation	Academia	Austin Tripp Unversity of Cambridge ajt212@cam.ac.uk Sergio Bacallado Unversity of Cambridge sb2116@cam.ac.uk Sukriti Singh University of Cambridge ss2971@cam.ac.uk José Miguel Hernández-Lobato University of Cambridge jmh233@cam.ac.uk
Pseudocode	No	The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes mathematical derivations and methods in prose and equations rather than structured code.
Open Source Code	Yes	Code to reproduce all experiments is available at: https://github.com/Austin T/tanimoto-random-features-neurips23.
Open Datasets	Yes	We choose to study a sample of 1000 small organic molecules from the Guaca Mol dataset (Brown et al., 2019; Mendez et al., 2019) which exemplify the types of molecules typically considered in drug discovery projects. Specifically, we study 5 tasks from the DOCKSTRING benchmark which entail predicting protein binding affinity from a molecular graph structure (García-Ortegón et al., 2022).
Dataset Splits	No	The paper mentions “training and test sets” but does not provide specific details on the percentage or number of samples used for each split, nor does it explicitly mention a distinct validation set or how it was separated.
Hardware Specification	No	The paper states: “The compute costs of the experiments in this paper were quite modest and were run on a single machine with no GPU usage.” However, it does not specify any detailed hardware components like CPU model, RAM, or storage, which are necessary for full reproducibility.
Software Dependencies	No	The paper lists software packages used: “All experiments were performed in python using the numpy (Harris et al., 2020), pytorch (Paszke et al., 2019), gpytorch (Gardner et al., 2018), and rdkit (Landrum et al., 2023) packages.” However, it does not provide specific version numbers for Python or any of these listed libraries, which is essential for reproducible software dependencies.
Experiment Setup	Yes	We use M = 5000 random features for all methods. Molecules were represented with both binary (B) and count (C) Morgan fingerprints... of dimension 1024. Our GP models use a constant mean and Gaussian noise. The variational parameters are optimized via natural gradient descent with a learning rate of 10−1 and a batch size of 2M = 10 000 for one pass through the dataset.