Stepwise Alignment for Constrained Language Model Policy Optimization
Authors: Akifumi Wachi, Thien Tran, Rei Sato, Takumi Tanabe, Youhei Akimoto
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness. |
| Researcher Affiliation | Collaboration | LY Corporation University of Tsukuba RIKEN AIP {akifumi.wachi, tran.thien, sato.rei, takumi.tanabe}@lycorp.co.jp akimoto@cs.tsukuba.ac.jp |
| Pseudocode | Yes | Algorithm 1 Stepwise Alignment for Constrained Policy Optimization (SACPO) |
| Open Source Code | Yes | Code and models are available at https://github.com/line/sacpo. |
| Open Datasets | Yes | We utilize the PKU-Safe RLHF preference dataset [25] with more than 30,000 expert evaluations. |
| Dataset Splits | Yes | Table 1: Hyper-parameters used in the two stages of our experiment. |
| Hardware Specification | Yes | Our experiments were conducted in a workstation with Intel(R) Xeon(R) Silver 4316 CPUs@2.30GHz and 8 NVIDIA A100-SXM4-80GB GPUs. |
| Software Dependencies | No | We use TRL [47] for implementing DPO and KTO. |
| Experiment Setup | Yes | The hyper-parameters used in our experiment for helpfulness and safety (i.e., harmlessness) are summarized in Table 1. |