reproducibilityindex.ai

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Authors: Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A broad range of experiments demonstrate the effectiveness and merits of our algorithms. We conduct extensive experiments to demonstrate the effectiveness of our proposed methods.
Researcher Affiliation	Academia	University of Pennsylvania. xinmengh@sas.upenn.edu lishuo1@seas.upenn.edu dobriban@wharton.upenn.edu obastani@seas.upenn.edu hassani@seas.upenn.edu dongshed@seas.upenn.edu
Pseudocode	Yes	Algorithm 1 MOCAN: Model-based Constrained Alignment via dualizatio N. Algorithm 2 PECAN: Preference-based Constrained Alignment via dualizatio N. Algorithm 3 PECAN with varying KL regularization in pre-alignment.
Open Source Code	Yes	The source code is available here.2 [footnote 2: https://github.com/shuoli90/CAN]
Open Datasets	Yes	We use the PKU-Safe RLHF-30K preference dataset [20]
Dataset Splits	Yes	We use the PKU-Safe RLHF-30K preference dataset [20], which contains approximately 27,000 training and 3,000 testing expert evaluations.
Hardware Specification	Yes	In practice, our experiments are conducted on a single 48G NVIDIA A6000 GPU
Software Dependencies	No	The paper mentions software components like 'PEFT strategy Lo RA' and 'Optimizer paged_adamw_32bit' in its hyperparameters table, but it does not specify specific version numbers for these or other key software dependencies like Python, PyTorch, or any underlying libraries used for implementation.
Experiment Setup	Yes	See Tables 1, 2, and 3 for the training-related hyper-parameters.