Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BioCG: Constrained Generative Modeling for Biochemical Interaction Prediction

Authors: Amitay Sicherman, Kira Radinsky

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Bio CG achieves state-of-the-art (SOTA) performance across diverse tasks, Drug-Target Interaction (DTI), Drug-Drug Interaction (DDI), and Enzyme-Reaction Prediction, especially in data-scarce and cold-start conditions. On the Bio SNAP DTI benchmark, for example, Bio CG attains an AUC of 89.31% on unseen proteins, representing a 14.3 percentage point gain over prior SOTA. The paper includes a dedicated section '4 Experiments' and '6 Ablations' discussing empirical evaluations and comparisons.
Researcher Affiliation	Academia	Amitay Sicherman Department of Computer Science Technion Israel Institute of Technology Israel EMAIL, Kira Radinsky Department of Computer Science Technion Israel Institute of Technology Israel EMAIL. Both authors are affiliated with the Technion Israel Institute of Technology, which is an academic institution.
Pseudocode	No	The paper describes the methods and processes, such as I-RVQ and constrained decoding, conceptually and through diagrams (Figure 1, Figure 2). However, it does not contain a formally labeled pseudocode block or algorithm listing with structured steps.
Open Source Code	Yes	To facilitate full reproducibility of all experiments, we release our open-source code (Git Hub). The complete implementation of Bio CG is publicly available at Git Hub.
Open Datasets	Yes	For DTI, we use the Bio SNAP dataset [44] (derived from Drug Bank [14]). DDI prediction is evaluated using data from Drug Bank [14]. We use the CARE benchmark [41] for evaluation.
Dataset Splits	Yes	For the Bio SNAP dataset, 'We follow established protocols [33, 13] for three splits. In the random split... For the unseen protein split... Symmetrically, in the unseen drug split...'. For DDI prediction, 'our cold-start drug query split involves test pairs where one drug is 'unseen' (the query, absent from training interactions) and the other is 'seen.''. For Enzyme-Reaction Prediction, 'We follow CARE s unseen reaction query split, where novel test reactions are input queries.'
Hardware Specification	Yes	Training is conducted on an L40 GPU (64GB), with the main model requiring approximately 6 hours and the meta-model around 5 minutes.
Software Dependencies	No	The paper mentions 'the standard k-means algorithm from scikit-learn [15]' but does not provide a specific version number for scikit-learn or any other software dependencies like Python or PyTorch with their versions.
Experiment Setup	Yes	Model hyperparameters were selected using a greedy search approach... The main Transformer model utilized a hidden dimension of 512... The I-RVQ module employed the standard k-means algorithm from scikit-learn [15] with 15 clusters. Both the main model and the meta-model were trained using the Adam W optimizer. The main model training ran for up to 25,000 steps, with early stopping based on performance on the validation AUC. The paper further details hyperparameters in Tables 6, 7, and 8 for the Main Model, Main Model Training, and Meta-Model, respectively.