Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DSRF: A Dynamic and Scalable Reasoning Framework for Solving RPMs

Authors: Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Xudong Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on six AVR tasks demonstrate DSRF s superior performance, achieving state-of-the-art results under various settings. Code is available here: https://github.com/UNNCRox Li/DSRF.
Researcher Affiliation	Academia	1The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China 2Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences 3Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 4Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China 5School of Electrical and Electronic Engineering, Nanyang Technological University
Pseudocode	No	The paper describes the architecture and operations in detail using text and diagrams (Figure 1, Figure 2, Figure 3), and provides detailed network architectures in tables (Table 19, 20, 21) in the appendix, but it does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code is available here: https://github.com/UNNCRox Li/DSRF.
Open Datasets	Yes	DSRF is compared with 16 state-of-the-art models, WRe N [7], Co PINet [36], SCL [37], SRAN [8], DCNet [38], MRNet [9], HCV-ARR [14], Alge MR [13], ARII [39], Pred RNet [15], STSN [40], SCAR [17], DRNet [16], TRIVR [41], HP2AI [19] and Slot Abstractors [42] on 6 RPM datasets, namely RAVEN [6], I-RAVEN [8], RAVEN-FAIR [9], PGM [7], Unicode Analogies (UA) [10] and RPM-like Video Prediction (RVP) [41].
Dataset Splits	Yes	We follow the standard evaluation protocol in [6, 8 10]. ... The standard 10-fold evaluation protocol [6] is applied, where six folds are used for training, and two folds each for validation and testing. ... The PGM dataset [7] ... with 1.2M questions for training, 20K for validation, and 200K for testing. ... Following the standard evaluation protocol in [10], a 10-fold evaluation is used, with seven folds for training, and one fold for validation and two folds for testing.
Hardware Specification	Yes	The models are trained with a batch size of 128 on an Intel Xeon Silver 4216 CPU with two NVIDIA RTX A5000 GPUs.
Software Dependencies	No	The Adam optimizer is applied with a learning rate of 1e-3 and weight decay of 1e-5. The batch size is set to 128. More details are provided in Appendix. ... We apply Muon to hidden-layer parameters with dimension at least 2 using a learning rate of 3e-3, and Adam to all remaining parameters using a learning rate of 1e-3, with weight decay set to 1e-5.
Experiment Setup	Yes	The Adam optimizer is applied with a learning rate of 1e-3 and weight decay of 1e-5. The batch size is set to 128. More details are provided in Appendix. ... We apply Muon to hidden-layer parameters with dimension at least 2 using a learning rate of 3e-3, and Adam to all remaining parameters using a learning rate of 1e-3, with weight decay set to 1e-5.