SALMON: Self-Alignment with Instructable Reward Models
Authors: Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, Chuang Gan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying our method to the LLa MA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLa MA-2-Chat-70b, on various benchmark datasets. |
| Researcher Affiliation | Collaboration | 1MIT-IBM Watson AI Lab, IBM Research 2Language Technologies Institute, CMU 3UMass Amherst |
| Pseudocode | No | The paper describes the methodology in text and with diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/IBM/SALMON |
| Open Datasets | Yes | Self-Align We use a combination of 90k Share GPT4 prompts, 10k prompts from databricks-dolly-15k dataset (Databricks, 2023), 10k prompts from Open Assistant Conversations dataset (K opf et al., 2023), and 40k prompts sub-sampled from the Open Orca dataset (Mukherjee et al., 2023; Lian et al., 2023)... |
| Dataset Splits | No | The paper mentions 'held-out RL data' but does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper mentions 'to fit all the models (i.e., policy, reward, value, original policy) into one GPU' but does not specify exact GPU models, CPU models, or other detailed hardware specifications used for experiments. |
| Software Dependencies | No | The paper mentions software like QLoRA, PPO, and langdetect but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We used a batch size of 576 for each PPO step. This comprised two epochs of gradient steps, each having 288 rollouts. We applied a peak learning rate of 2 * 10^-5 with cosine decay. We clipped the gradient by its Euclidean norm at a limit of 1. Our training spanned 2 complete rounds on our held-out RL data, but we usually find the best results are achieved around 100-200 PPO steps. For generalized advantage estimation (GAE; Schulman et al. (2015)), both lambda and gamma were set at 1. We opted for a constant KL regularizer coefficient of 0.02. |