Training Socially Aligned Language Models on Simulated Social Interactions

Authors: Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Diyi Yang, Soroush Vosoughi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Stable Alignment outperforms existing methods in six alignment benchmarks. Notably, it facilitates easy deployment in resource-constrained settings by removing the need for an additional reward model to provide proximal supervision during training, such as Open AI s RLHF. We comprehensively assess the trained models, evaluating them against both conventional alignment benchmarks and adversarial attack scenarios. Our results reveal that the inclusion of feedback and revision significantly boosts the models robustness against jailbreaking prompts ( 4.1). Ablation studies further confirm the importance of specialized data preparation for efficient and stable alignment learning.
Researcher Affiliation Academia Ruibo Liu Dartmouth College Ruixin Yang University of British Columbia Chenyan Jia Stanford University, Northeastern University Ge Zhang University of Michigan, Ann Arbor Diyi Yang Stanford University Soroush Vosoughi Dartmouth College
Pseudocode Yes Appendix A.3 provides the pseudocode for implementing CPO. Pseudo-code for the Stable Alignment algorithm
Open Source Code Yes We introduce SANDBOX, an open-source platform for simulating human society ( 3.1). To facilitate peer review and subsequent research, we have included all necessary materials for reproducing Stable Alignment including data, code, and launching scripts as supplementary materials accompanying this submission.
Open Datasets Yes Our pool of controversial societal questions comprised 9,662 questions sourced from the Anthropic RLHF dataset2. 2Anthropic HH dataset: https://github.com/anthropics/hh-rlhf. We trained our model on the released Stanford Alpaca checkpoint4 with 8 A100 80G GPUs, using both SFT and Stable Alignment methodologies. 4Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca.
Dataset Splits No The paper mentions assessing alignment on the "validation set" but does not specify how this validation set was created, its size, or the specific split percentages from their constructed 169k samples, which would be needed for reproducibility.
Hardware Specification Yes We trained our model on the released Stanford Alpaca checkpoint4 with 8 A100 80G GPUs, using both SFT and Stable Alignment methodologies.
Software Dependencies No The paper mentions software components and training processes but does not specify version numbers for any libraries, frameworks, or programming languages (e.g., PyTorch, Python, CUDA versions) that would be needed for reproducibility.
Experiment Setup Yes The total training time was approximately 10 hours across two epochs. The initial learning rates for both SFT and Stable Alignment training were set at 2.0e-5 and used cosine annealing with a warmup ratio of 0.03. As detailed in Section 4.2, we selected a λ value of 0.2 and a mini-batch size of four, incorporating three low-rating responses in each mini-batch.