reproducibilityindex.ai

Stepwise Alignment for Constrained Language Model Policy Optimization

Authors: Akifumi Wachi, Thien Tran, Rei Sato, Takumi Tanabe, Youhei Akimoto

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.
Researcher Affiliation	Collaboration	LY Corporation University of Tsukuba RIKEN AIP {akifumi.wachi, tran.thien, sato.rei, takumi.tanabe}@lycorp.co.jp akimoto@cs.tsukuba.ac.jp
Pseudocode	Yes	Algorithm 1 Stepwise Alignment for Constrained Policy Optimization (SACPO)
Open Source Code	Yes	Code and models are available at https://github.com/line/sacpo.
Open Datasets	Yes	We utilize the PKU-Safe RLHF preference dataset [25] with more than 30,000 expert evaluations.
Dataset Splits	Yes	Table 1: Hyper-parameters used in the two stages of our experiment.
Hardware Specification	Yes	Our experiments were conducted in a workstation with Intel(R) Xeon(R) Silver 4316 CPUs@2.30GHz and 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies	No	We use TRL [47] for implementing DPO and KTO.
Experiment Setup	Yes	The hyper-parameters used in our experiment for helpfulness and safety (i.e., harmlessness) are summarized in Table 1.