Aligning to Thousands of Preferences via System Message Generalization

Authors: Seongyun Lee, Sue Hyun Park, Seungone Kim, Minjoon Seo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this dataset, we train a 7B LLM called JANUS and test it on 921 prompts from 5 benchmarks (Alpaca Eval 2.0, FLASK, Koala, MT-Bench, and Self-Instruct) by adding system messages that reflect unseen user values.
Researcher Affiliation Academia KAIST AI1 Carnegie Mellon University2
Pseudocode No The paper describes its methods in narrative text and does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code, dataset, benchmark, and models are available at https://lklab.kaist.ac.kr/Janus/.
Open Datasets Yes We first select 66k instructions from a pool of existing four high-quality preference datasets: Chatbot Arena Conversations [92], Domain-Specific Preference dataset [8], Ultra Feedback-binarized-clean [9], Nectar [94], and Open Hermes Preferences [23].
Dataset Splits No The paper describes training on the MULTIFACETED COLLECTION and evaluating on various benchmarks (MULTIFACETED BENCH, Alpaca Eval 2.0, MT-Bench, Arena Hard Auto v0.1), which serve as test sets. However, it does not explicitly state the use of a separate validation dataset split or its size/proportion for model training or hyperparameter tuning.
Hardware Specification Yes To train JANUS 7B, we utilize four NVIDIA A100 80GB GPUs, and for inference, four NVIDIA RTX A6000 GPUs are employed. Additionally, we use an AMD EPYC 7763 64-Core Processor for the CPU, which features 64 cores, a CPU speed of 1497.674 MHz, and a cache size of 512KB.
Software Dependencies No The paper mentions several libraries used (e.g., axolotl library, Open RLHF, VLLM library, Flash Attention-2, Deep Speed Zero-3) but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes For instruction tuning, the configuration includes a maximum sequence length of 8192, gradient accumulation steps of 4, a micro batch size of 2, and four epochs. We use the adamw_bnb_8bit optimizer, with a cosine learning rate scheduler and a learning rate of 5e-6. Additionally, we employ gradient checkpointing, Flash Attention-2 [10], and mixed precision for efficient training. Warm-up steps are set at 10 and weight decay at 0, with checkpoints saved after each epoch.