One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Authors: Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A broad range of experiments demonstrate the effectiveness and merits of our algorithms. We conduct extensive experiments to demonstrate the effectiveness of our proposed methods.
Researcher Affiliation Academia University of Pennsylvania. xinmengh@sas.upenn.edu lishuo1@seas.upenn.edu dobriban@wharton.upenn.edu obastani@seas.upenn.edu hassani@seas.upenn.edu dongshed@seas.upenn.edu
Pseudocode Yes Algorithm 1 MOCAN: Model-based Constrained Alignment via dualizatio N. Algorithm 2 PECAN: Preference-based Constrained Alignment via dualizatio N. Algorithm 3 PECAN with varying KL regularization in pre-alignment.
Open Source Code Yes The source code is available here.2 [footnote 2: https://github.com/shuoli90/CAN]
Open Datasets Yes We use the PKU-Safe RLHF-30K preference dataset [20]
Dataset Splits Yes We use the PKU-Safe RLHF-30K preference dataset [20], which contains approximately 27,000 training and 3,000 testing expert evaluations.
Hardware Specification Yes In practice, our experiments are conducted on a single 48G NVIDIA A6000 GPU
Software Dependencies No The paper mentions software components like 'PEFT strategy Lo RA' and 'Optimizer paged_adamw_32bit' in its hyperparameters table, but it does not specify specific version numbers for these or other key software dependencies like Python, PyTorch, or any underlying libraries used for implementation.
Experiment Setup Yes See Tables 1, 2, and 3 for the training-related hyper-parameters.