Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Post Hoc Regression Refinement via Pairwise Rankings
Authors: Kevin Tirta Wijaya, Michael Sun, Minghao Guo, Hans-peter Seidel, Wojciech Matusik, Vahid Babaei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on synthetic and real-world benchmarks, including multiple MPP tasks, demonstrate that Rank Refine consistently boosts predictive performance. |
| Researcher Affiliation | Academia | Kevin Tirta Wijaya MPI-INF EMAIL Michael Sun MIT EMAIL Minghao Guo MIT EMAIL Hans-Peter Seidel MPI-INF EMAIL Wojciech Matusik MIT EMAIL Vahid Babaei MPI-INF EMAIL |
| Pseudocode | No | The paper describes the Rank Refine framework in Section 3 and illustrates it with Figure 1, but it does not contain a dedicated pseudocode block or algorithm steps formatted as code. The methodology is explained using mathematical equations and descriptive text. |
| Open Source Code | Yes | The source code is available at https://github.com/ktirta/regref. |
| Open Datasets | Yes | We use nine molecular datasets from the TDC ADME benchmark [Huang et al., 2021]: Caco-2 [Wang et al., 2016], Clearance Microsome and Clearance Hepatocyte [Di et al., 2012], log Half Life [Obach et al., 2008], Free Solv [Mobley and Guthrie, 2014], Lipophilicity [Wu et al., 2018], PPBR, Solubility [Sorkun et al., 2019], and VDss [Lombardo and Jing, 2016]. We additionally test three tabular regressions: crop-yield prediction from sensor data [Soundankar, 2025], student-performance prediction [Cortez, 2014], international-education cost estimation [Shamim, 2025]. In the human-as-ranker experiment, we use UTKFace [Zhang et al., 2017] for age estimation. |
| Dataset Splits | Yes | To emulate low-data regimes, we sample 50 training points from above datasets uniformly at random and merge the remainder with the original test split, repeating this re-split over five random seeds. |
| Hardware Specification | Yes | Unless stated otherwise, the base model is a random-forest regressor from scikit-learn [Pedregosa et al., 2011] with default hyper-parameters, executed on a single CPU. |
| Software Dependencies | No | The paper mentions 'scikit-learn' as the library for the random-forest regressor but does not specify a version number. It also mentions 'Chat GPT-4o [Open AI, 2025]' but this is used as a ranker, not a software dependency for the core methodology's implementation. |
| Experiment Setup | Yes | Unless stated otherwise, the base model is a random-forest regressor from scikit-learn [Pedregosa et al., 2011] with default hyper-parameters, executed on a single CPU. |