Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On a Connection Between Imitation Learning and RLHF

Authors: Teng Xiao, Yige Yuan, Mingxiao Li, Zhengyu Chen, Vasant Honavar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks. The code for DIL is available at https://github.com/tengxiao1/DIL. [...] Empirically, we validate the effectiveness of DIL on widely used benchmarks, demonstrating that it outperforms previous alignment methods.
Researcher Affiliation Collaboration Teng Xiao , Yige Yuan , Mingxiao Li , Zhengyu Chen , Vasant G Honavar Pennsylvania State University University of Chinese Academy of Sciences Tencent AI Lab Meituan Inc EMAIL, EMAIL
Pseudocode No The paper describes methods and derivations but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code for DIL is available at https://github.com/tengxiao1/DIL.
Open Datasets Yes We evaluate DIL on widely used datasets: the Ultra Feedback Binarized dataset (Cui et al., 2023; Tunstall et al., 2023), the Reddit TL;DR summarization dataset (V olske et al., 2017), and the Anthropic-HH dataset (Bai et al., 2022). The details of these datasets are provided in Appendix B.1. [...] Ultra Feedback Binarized (Cui et al., 2023; Tunstall et al., 2023): This dataset1 [...] Anthropic-HH (Bai et al., 2022): The Anthropic Helpful and Harmless dialogue dataset2 [...] Reddit TL;DR Summarization (V olske et al., 2017): This dataset3
Dataset Splits No The paper mentions using specific datasets and "5-shot setting for GSM8K, and 25-shot for ARC" for evaluation, but does not provide explicit training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) in the main text or appendix for their own experimental setup.
Hardware Specification Yes All training experiments described in this paper were conducted using either four NVIDIA A100 80GB GPUs with 128 batchsize, utilizing the codebase from the alignment-handbook repository.
Software Dependencies No The paper mentions using the 'Adam optimizer (Kingma, 2014)' and 'GPT-4 for zero-shot pairwise evaluation', and 'the codebase from the alignment-handbook repository', but does not provide specific version numbers for core software components like Python, PyTorch, or CUDA.
Experiment Setup Yes Specifically, during the SFT stage, we applied a learning rate of 2 10 5. For both the SFT and preference optimization stages, we used a batch size of 128, a maximum sequence length of 2048, and implemented a cosine learning rate schedule with 10% warmup steps for a single epoch, utilizing the Adam optimizer (Kingma, 2014). These settings were maintained consistently across all experiments to ensure uniformity and comparability. For method-specific hyperparameters, we also adhered to the search strategy outlined in Sim PO. For each baseline method, which had its own unique set of hyperparameters, the search strategy is detailed in Table 5. Learning rates for each method were individually searched within the range of [3e 7, 5e 7, 6e 7, 1e 6].