reproducibilityindex.ai

SALMON: Self-Alignment with Instructable Reward Models

Authors: Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, Chuang Gan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applying our method to the LLa MA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLa MA-2-Chat-70b, on various benchmark datasets.
Researcher Affiliation	Collaboration	1MIT-IBM Watson AI Lab, IBM Research 2Language Technologies Institute, CMU 3UMass Amherst
Pseudocode	No	The paper describes the methodology in text and with diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/IBM/SALMON
Open Datasets	Yes	Self-Align We use a combination of 90k Share GPT4 prompts, 10k prompts from databricks-dolly-15k dataset (Databricks, 2023), 10k prompts from Open Assistant Conversations dataset (K opf et al., 2023), and 40k prompts sub-sampled from the Open Orca dataset (Mukherjee et al., 2023; Lian et al., 2023)...
Dataset Splits	No	The paper mentions 'held-out RL data' but does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper mentions 'to fit all the models (i.e., policy, reward, value, original policy) into one GPU' but does not specify exact GPU models, CPU models, or other detailed hardware specifications used for experiments.
Software Dependencies	No	The paper mentions software like QLoRA, PPO, and langdetect but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We used a batch size of 576 for each PPO step. This comprised two epochs of gradient steps, each having 288 rollouts. We applied a peak learning rate of 2 * 10^-5 with cosine decay. We clipped the gradient by its Euclidean norm at a limit of 1. Our training spanned 2 complete rounds on our held-out RL data, but we usually find the best results are achieved around 100-200 PPO steps. For generalized advantage estimation (GAE; Schulman et al. (2015)), both lambda and gamma were set at 1. We opted for a constant KL regularizer coefficient of 0.02.