Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MUBen: Benchmarking the Uncertainty of Molecular Representation Models

Authors: Yinghao Li, Lingkai Kong, Yuanqi Du, Yue Yu, Yuchen Zhuang, Wenhao Mu, Chao Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present MUBen, a benchmark designed to assess the performance of UQ methods applied to the molecular representation models for property prediction on various metrics, as illustrated in Figure 1. It encompasses UQ methods from different categories, including deterministic prediction, Bayesian neural networks, post-hoc calibration, and ensembles ( 4.2), on top of a set of molecular representation (backbone) models, each relies on a molecular descriptor from a distinct perspective ( 4.1). MUBen delivers intriguing results and insights that may guide the selection of backbone model and/or UQ methods in practice ( 5).
Researcher Affiliation	Academia	Yinghao Li1, Lingkai Kong1, Yuanqi Du2, Yue Yu1, Yuchen Zhuang1, Wenhao Mu1, Chao Zhang1 1EMAIL; EMAIL 1Georgia Institute of Technology; 2Cornell University
Pseudocode	No	The paper describes various Uncertainty Quantification methods (Section 4.2) and their implementation details (Appendix C.1) but does not present these methods or any other procedures in structured pseudocode or algorithm blocks.
Open Source Code	Yes	We structure our code, available at https://github.com/Yinghao-Li/MUBen, to be user-friendly and easily transferrable and extendable, with the hope that this work will promote the future development of UQ methods, pre-trained models, or applications within the domains of materials science and drug discovery.
Open Datasets	Yes	We carry out our experiments on Molecule Net (Wu et al., 2018), a collection of widely utilized datasets covering molecular properties such as quantum mechanics, solubility, and toxicity. [...] The raw datasets can be accessed on the Molecule Net website.1 Additionally, we employ the pre-processed versions provided by Zhou et al. (2023), using the identical dataset splits outlined in their study.2
Dataset Splits	Yes	In line with previous studies (Fang et al., 2022; Zhou et al., 2023), MUBen divides all datasets with scaffold splitting to minimize the impact of dataset randomness and, consequently, enhances the reliability of the evaluation. Moreover, scaffold splitting alienates the molecular features in each dataset split, inherently creating a challenging OOD setup that better reflects the real-world scenario. For comparison, we also briefly report the results of random splitting. Please check appendix A for detailed descriptions. [...] We adhere to the standard 8:1:1 ratio for training, validation, and test splits across all datasets. The raw datasets can be accessed on the Molecule Net website.1 Additionally, we employ the pre-processed versions provided by Zhou et al. (2023), using the identical dataset splits outlined in their study.2
Hardware Specification	Yes	The experiments are conducted on a single NVIDIA A100 Tensor Core GPU with a memory capacity of 80GB.
Software Dependencies	No	The paper mentions software like "Py Torch framework (Paszke et al., 2019)", "Hugging Face Transformers library (Wolf et al., 2019)", and "Pytorch Geometric (Fey & Lenssen, 2019)", but it does not specify explicit version numbers for these software dependencies.
Experiment Setup	Yes	The backbone models are fine-tuned using the Adam W optimizer (Loshchilov & Hutter, 2019) with a weight decay rate of 0.01 using full-precision floating-point numbers (FP32) for maximum compatibility. We apply different learning rates, numbers of training epochs and batch sizes for the backbone models, as specified in the following paragraphs. We adopt early stopping to select the best-performed checkpoint on the validation set, and all models have achieved their peak validation performance before the training ends. ROC-AUC is selected to assess classification validation performance. For regression, we follow Wu et al. (2018) and use RMSE for Physical Chemistry properties and MAE for Quantum Mechanics. [...] Chem BERTa is fine-tuned with a learning rate of 10 5, a batch size of 128, and for 200 epochs. A tolerance of 40 epochs for early stopping is adopted. [...] we configure the fine-tuning batch size at 256, and the number of epochs at 100 with a tolerance of 40 epochs for early stopping. The learning rate is set at 10 4, and the entire model has a dropout ratio of 0.1.