Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On a Connection Between Imitation Learning and RLHF

Authors: Teng Xiao, Yige Yuan, Mingxiao Li, Zhengyu Chen, Vasant Honavar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks. The code for DIL is available at https://github.com/tengxiao1/DIL. [...] Empirically, we validate the effectiveness of DIL on widely used benchmarks, demonstrating that it outperforms previous alignment methods.
Researcher Affiliation	Collaboration	Teng Xiao , Yige Yuan , Mingxiao Li , Zhengyu Chen , Vasant G Honavar Pennsylvania State University University of Chinese Academy of Sciences Tencent AI Lab Meituan Inc EMAIL, EMAIL
Pseudocode	No	The paper describes methods and derivations but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code for DIL is available at https://github.com/tengxiao1/DIL.
Open Datasets	Yes	We evaluate DIL on widely used datasets: the Ultra Feedback Binarized dataset (Cui et al., 2023; Tunstall et al., 2023), the Reddit TL;DR summarization dataset (V olske et al., 2017), and the Anthropic-HH dataset (Bai et al., 2022). The details of these datasets are provided in Appendix B.1. [...] Ultra Feedback Binarized (Cui et al., 2023; Tunstall et al., 2023): This dataset1 [...] Anthropic-HH (Bai et al., 2022): The Anthropic Helpful and Harmless dialogue dataset2 [...] Reddit TL;DR Summarization (V olske et al., 2017): This dataset3
Dataset Splits	No	The paper mentions using specific datasets and "5-shot setting for GSM8K, and 25-shot for ARC" for evaluation, but does not provide explicit training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) in the main text or appendix for their own experimental setup.
Hardware Specification	Yes	All training experiments described in this paper were conducted using either four NVIDIA A100 80GB GPUs with 128 batchsize, utilizing the codebase from the alignment-handbook repository.
Software Dependencies	No	The paper mentions using the 'Adam optimizer (Kingma, 2014)' and 'GPT-4 for zero-shot pairwise evaluation', and 'the codebase from the alignment-handbook repository', but does not provide specific version numbers for core software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Specifically, during the SFT stage, we applied a learning rate of 2 10 5. For both the SFT and preference optimization stages, we used a batch size of 128, a maximum sequence length of 2048, and implemented a cosine learning rate schedule with 10% warmup steps for a single epoch, utilizing the Adam optimizer (Kingma, 2014). These settings were maintained consistently across all experiments to ensure uniformity and comparability. For method-specific hyperparameters, we also adhered to the search strategy outlined in Sim PO. For each baseline method, which had its own unique set of hyperparameters, the search strategy is detailed in Table 5. Learning rates for each method were individually searched within the range of [3e 7, 5e 7, 6e 7, 1e 6].