Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Post Hoc Regression Refinement via Pairwise Rankings

Authors: Kevin Tirta Wijaya, Michael Sun, Minghao Guo, Hans-peter Seidel, Wojciech Matusik, Vahid Babaei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic and real-world benchmarks, including multiple MPP tasks, demonstrate that Rank Refine consistently boosts predictive performance.
Researcher Affiliation	Academia	Kevin Tirta Wijaya MPI-INF EMAIL Michael Sun MIT EMAIL Minghao Guo MIT EMAIL Hans-Peter Seidel MPI-INF EMAIL Wojciech Matusik MIT EMAIL Vahid Babaei MPI-INF EMAIL
Pseudocode	No	The paper describes the Rank Refine framework in Section 3 and illustrates it with Figure 1, but it does not contain a dedicated pseudocode block or algorithm steps formatted as code. The methodology is explained using mathematical equations and descriptive text.
Open Source Code	Yes	The source code is available at https://github.com/ktirta/regref.
Open Datasets	Yes	We use nine molecular datasets from the TDC ADME benchmark [Huang et al., 2021]: Caco-2 [Wang et al., 2016], Clearance Microsome and Clearance Hepatocyte [Di et al., 2012], log Half Life [Obach et al., 2008], Free Solv [Mobley and Guthrie, 2014], Lipophilicity [Wu et al., 2018], PPBR, Solubility [Sorkun et al., 2019], and VDss [Lombardo and Jing, 2016]. We additionally test three tabular regressions: crop-yield prediction from sensor data [Soundankar, 2025], student-performance prediction [Cortez, 2014], international-education cost estimation [Shamim, 2025]. In the human-as-ranker experiment, we use UTKFace [Zhang et al., 2017] for age estimation.
Dataset Splits	Yes	To emulate low-data regimes, we sample 50 training points from above datasets uniformly at random and merge the remainder with the original test split, repeating this re-split over five random seeds.
Hardware Specification	Yes	Unless stated otherwise, the base model is a random-forest regressor from scikit-learn [Pedregosa et al., 2011] with default hyper-parameters, executed on a single CPU.
Software Dependencies	No	The paper mentions 'scikit-learn' as the library for the random-forest regressor but does not specify a version number. It also mentions 'Chat GPT-4o [Open AI, 2025]' but this is used as a ranker, not a software dependency for the core methodology's implementation.
Experiment Setup	Yes	Unless stated otherwise, the base model is a random-forest regressor from scikit-learn [Pedregosa et al., 2011] with default hyper-parameters, executed on a single CPU.