Stepwise Alignment for Constrained Language Model Policy Optimization

Authors: Akifumi Wachi, Thien Tran, Rei Sato, Takumi Tanabe, Youhei Akimoto

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.
Researcher Affiliation Collaboration LY Corporation University of Tsukuba RIKEN AIP {akifumi.wachi, tran.thien, sato.rei, takumi.tanabe}@lycorp.co.jp akimoto@cs.tsukuba.ac.jp
Pseudocode Yes Algorithm 1 Stepwise Alignment for Constrained Policy Optimization (SACPO)
Open Source Code Yes Code and models are available at https://github.com/line/sacpo.
Open Datasets Yes We utilize the PKU-Safe RLHF preference dataset [25] with more than 30,000 expert evaluations.
Dataset Splits Yes Table 1: Hyper-parameters used in the two stages of our experiment.
Hardware Specification Yes Our experiments were conducted in a workstation with Intel(R) Xeon(R) Silver 4316 CPUs@2.30GHz and 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies No We use TRL [47] for implementing DPO and KTO.
Experiment Setup Yes The hyper-parameters used in our experiment for helpfulness and safety (i.e., harmlessness) are summarized in Table 1.