Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Authors: Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments, from synthetic bandits, controllable generation, fine-tuning Pythia 2.8B on off-policy Anthropic HH dataset, to fine-tuning Llama3-8B-Instruct on a on-policy Ultra Feedback prompts based dataset. Notably, we perform an exclusive hyperparameter search for a fair comparison, and repeat for different random seeds to justify the significance of the improvement.
Researcher Affiliation Academia Haoxian Chen1 , Hanyang Zhao1 , Henry Lam1, David D. Yao1, Wenpin Tang1 1Columbia University, Department of IEOR
Pseudocode No The paper contains mathematical derivations and theoretical models but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/haoxian-chen/Mallows PO.
Open Datasets Yes First, we use the preferences dataset of IMDB (Maas et al., 2011) datasets and Anthropic Helpful and Harmless dialogue (Bai et al., 2022a) dataset to provide evidence that human preferences may be diversed. [...] The H4 Stack Exchange Preferences Dataset (SE) (Lambert et al., 2023) and Stanford Human Preferences (SHP) (Ethayarajh et al., 2022) are used for evaluation.
Dataset Splits Yes For the in-distribution test, we first fine-tune a pretrained Pythia-2.8B model on the training set of Anthropic HH dataset using Mallows PO and DPO, and then evaluate their responses on a subset of its test split. GPT-4 serves as the evaluator, and compares pairs of responses. [...] In the task of conditional generation for IMDB, x is a prefix of movie review, and LM is to generate output y with positive sentiment. Following the setting in Rafailov et al. (2023), we first fine-tune GPT-2-large on the training split of IMDB datasets until convergence to get the SFT model, and use the pairwise preference data from Wang et al. (2023) to further fine-tune it by DPO and Mallows PO.
Hardware Specification No The paper mentions the models used for experiments (e.g., Pythia 2.8B, Llama3-8B-Instruct) but does not provide specific details about the hardware used to run these experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies No The paper mentions the use of the 'fastchat package for GPT-4 evaluation' but does not provide specific version numbers for this or any other key software dependencies like programming languages, libraries, or frameworks used in the implementation.
Experiment Setup Yes In terms of the training details, we use all 16 data in a single batch and adopts SGD as the optimizer, with learning rate of 5e-3. To ensure convergence, we run the optimization for a large number of epochs, set to 500,000. For Mallows PO-ϕ, we set ϕ to be 0.05. [...] By default, we use RMSprop optimizer with a learning rate of 1e-6, with a linear learning rate warmup from 0 to 1e-6 over the first 150 steps. The training batch size is 64. [...] We compare the performance of DPO and Mallows PO-θ in 6 configs by combining commonly used β {0.01, 0.05, 0.1} and lr {e-6, 5e-7}.