Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

Authors: Xiaoyu Yang, Jie Lu, En Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we verify the robustness, generalization and coordination of our proposed counterfactual preference optimization in reinforced custom-tuning in non-stationary environments. ... As exhibited in Table 2, our counterfactual preference optimization approach achieve the superior overall performance of 81.8%... Beyond the classification, we verify our main contribution of accurate reasoning... As exhibited in the Table 3, the experiments of diagnostic report generation on MIMIC-CXR are conducted to assess the performance... Furthermore, we validated the generalization of our model on downstream tasks with zero-shot multi-label classification across six different benchmarks, as presented in Table 4. ... Moreover, we conduct ablation experiments on MIMIC-CXR to validate the feasibility and coordination of the chain-of-thought (CPO) and counterfactual preference optimization (CPO) within reinforced fine-tuning (RFT) in non-stationary environments, as presented in Table 5.
Researcher Affiliation	Academia	Xiaoyu Yang, Jie Lu, En Yu Australian Artificial Intelligence Institute (AAII), Faulty of Engineering and Information Technology, University of Technology Sydney, Australia. EMAIL; EMAIL
Pseudocode	No	The paper describes its methodology using descriptive text, mathematical formulas, and diagrams, but it does not contain any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Our code and data are public at: https://github.com/Xiaoyu Young/CPO. ... We made our code and data publicly available on github as anonymous. The anonymous link is https://anonymous.4open.science/r/CPO-FD61/.
Open Datasets	Yes	Finally, we contribute CXR-Counter Fact (CCF), the chest diagnosis preference dataset comprising 320,416 fine-curated counterfactual reasoning trajectories derived from MIMIC-CXR [11] radiologic findings... Our code and data are public at: https://github.com/Xiaoyu Young/CPO. ... We develop a datasets called CXR-Counter Fact Dataset (CCF), extending the MIMIC-CXR[11] with counterfactual chain-of-thought. This novel dataset introduces 320,416 meticulously curated counterfactual pairs spanning 14 thoracic pathologies... MIMIC-CXR[11] is utilized to train the MLLMs via reinforced custom-tuning for domain adaptation, which presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. ... We selected the MIMIC-CXR [11] dataset not only for its well-established benchmark enabling rigorous performance evaluation on real-world downstream medical tasks... We also contribute a large-scale dataset CXR-Counter Fact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public at: https://github.com/Xiaoyu Young/CPO.
Dataset Splits	Yes	MIMIC-CXR[11] is utilized to train the MLLMs via reinforced custom-tuning for domain adaptation... First, to explicitly demonstrate the superior performance of our proposed method in non-stationary environments, especially in robustness, we compare it with other models on MS-CXR-T [17], where instances are chosen from the public MIMIC-CXR. ... The results are based on the test split of the MSCXR-T, with Top-1 accuracy (Acc) as the metric.
Hardware Specification	Yes	Unless otherwise specified, the supervised fine-tuning of our multi-modal large language model consists of 660 steps, executed on 2 × 2 NVIDIA A100 GPUs. ... The reinforced custom-tuning consists of 7,750 steps, executed on 2 × 2 NVIDIA A100 GPUs. ... The construction of the CCF dataset took approximately five days on four A100 GPUs, including both inference and validation.
Software Dependencies	No	The paper mentions using Qwen2.5-VL (7B) [16] as the pre-trained model and AdamW optimizer with specific beta values, but it does not specify software environment details such as programming language versions (e.g., Python), or library versions (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	In terms of the supervised fine-tuning progress, the hyperparameters are presented in Table 6. Qwen2.5-VL (7B) [16] is applied as our pre-trained model. During the SFT, we utilize the Adam W optimizer, which is configured with a cosine annealing schedule as the learning policy. The initial learning rate is set to 1 × 10^-4, and the Adam W optimizer is employed with hyperparameters β = (0.9, 0.98). Additionally, we set the weight decay to 0.05 and the dropout rate to 0.1. During the first 20 warm-up steps, the learning rate increases to 1 × 10^-4, and subsequently decays to 10^-7. Unless otherwise specified, the supervised fine-tuning of our multi-modal large language model consists of 660 steps, executed on 2 × 2 NVIDIA A100 GPUs. ... While in the counterfactual preference optimization (CPO), the initial learning rate is reduced to 2 × 10^-5 without the warmup. The visual encoder and text decoder are frozen out of the training. Thus, the batch size can be decreased to 4. The reinforced custom-tuning consists of 7,750 steps, executed on 2 × 2 NVIDIA A100 GPUs. Other training parameters are the same as the fine-tuning.