Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Authors: Subhojyoti Mukherjee, Viet Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup B. Rao, Jayakumar Subramanian, Branislav Kveton

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality. 4 Experiments We evaluate our methods on 6 datasets. Open Book QA [43], ARC [14], Sci QA [71], and MMLU [22] are standard QA benchmarks.
Researcher Affiliation Industry Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi Adobe Research EMAIL Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton Adobe Research
Pseudocode Yes Algorithm 1 Refit / Swift 1: Input: Learning rate schedule (αi)i N 2: Generate a logged dataset D = {(x, τn, r)}, where r R is a reward of τn (Refit) or a standardized reward of τn (Swift) 3: Initialize θ and i 1 4: for all (x, τn, r) D do 5: gi r Pn t=1 log π(at | x, τt 1; θ) 6: θ θ + αigi and i i + 1 7: Output: Learned policy θ
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We did not get an approval to release the code. Despite this, Refit and Swift are trivial to implement. We provide extensive details in Appendix to reproduce our results.
Open Datasets Yes 4 Experiments We evaluate our methods on 6 datasets. Open Book QA [43], ARC [14], Sci QA [71], and MMLU [22] are standard QA benchmarks. We convert a text-to-SQL conversation dataset Co SQL [73] and math tutoring dataset Math Dial [42] into QA-style conversational datasets. Our datasets cover a variety of domains and are described in more detail in Appendix D.
Dataset Splits Yes Experimental Setup: For our experiments, we randomly selected 500 samples from each dataset, allocating 400 for training and 100 for testing. We created conversations with 3 turns and generated 3 random runs (trajectories) with different temperatures using our Base model.
Hardware Specification No Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We report compute resources in Appendix. (...) Hardware Configuration num_processes 2 num_machines 1
Software Dependencies No G Model and Training Parameters In this section, we present the model configuration and training parameters for our framework in Tables 23 to 27. Table 23: Llama 3.1 8B Instruct Configuration Table 24: Accelerate Deep Speed Configuration Table 25: Accelerate Deep Speed Configuration for Knowledge Distillation Table 26: TRL Supervised Fine-Tuning Configuration with Customized model RL Reweighting for Refit and Swift
Experiment Setup Yes G Model and Training Parameters In this section, we present the model configuration and training parameters for our framework in Tables 23 to 27. Table 26: TRL Supervised Fine-Tuning Configuration with Customized model RL Reweighting for Refit and Swift Training Parameters learning_rate 3e-5 num_train_epochs 4 per_device_train_batch_size 8 gradient_accumulation_steps 4 gradient_checkpointing True mixed_precision bf16 do_train True do_eval False logging_steps 5 logging_first_step True save_strategy epoch save_total_limit 4