Generative Judge for Evaluating Alignment

Authors: Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, Pengfei Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, AUTO-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.
Researcher Affiliation Collaboration Junlong Li1,6 Shichao Sun3,6 Weizhe Yuan4 Run-Ze Fan5,6 Hai Zhao1 Pengfei Liu1,2,6 1Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3Hong Kong Polytechnic University 4New York University 5Chinese Academy of Sciences 6Generative AI Research Lab (GAIR)
Pseudocode No The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, AUTO-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.
Open Datasets Yes To start with, we first collect a large collection of data from the following sources: Chatbot Arena Conversations and MTBench (Zheng et al., 2023), Open AI Summary (Stiennon et al., 2020), Open AI Web GPT (Nakano et al., 2021), Stanford SHP (Ethayarajh et al., 2022), Synthetic GPT-J (Havrilla, 2023), and PKU-Safe RLHF (Ji et al., 2023). All these datasets are publicly available preference datasets with human preference comparisons containing two model-generated responses (win, lose, or tie) sharing the same query (and previous dialogue).
Dataset Splits No The paper specifies an "8:2 train/test split" for the scenario classifier but does not explicitly mention a separate validation split for the main model training or evaluation process.
Hardware Specification Yes We train AUTO-J from LLa MA-2-13B-chat (Touvron et al., 2023b) with the Deep Speed (Rasley et al., 2020) library, Zero Redundancy Optimizer (Ze RO) (Rajbhandari et al., 2020; Ren et al., 2021) Stage 3, gradient-checkpointing (Chen et al., 2016) and Flash Attention (Dao et al., 2022; Dao, 2023) on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software components like "Deep Speed", "Zero Redundancy Optimizer", "Flash Attention", and the base model "LLa MA-2-13B-chat", but it does not provide specific version numbers for these software dependencies as required for reproducibility.
Experiment Setup Yes The model is trained for 5 epochs (675 parameter update steps in total) and we save checkpoints for every 50 steps. We use Adam W (Loshchilov & Hutter, 2017) as our optimizer with β1 = 0.9, β2 = 0.95 and weight decay of 0.1. We use a peak learning rate 1e-5 with 3% warmup steps and cosine learning rate decay to 0, and set the batch size to 64 and maximum sequence length to 4,096.