One-Shot Safety Alignment for Large Language Models via Optimal Dualization
Authors: Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A broad range of experiments demonstrate the effectiveness and merits of our algorithms. We conduct extensive experiments to demonstrate the effectiveness of our proposed methods. |
| Researcher Affiliation | Academia | University of Pennsylvania. xinmengh@sas.upenn.edu lishuo1@seas.upenn.edu dobriban@wharton.upenn.edu obastani@seas.upenn.edu hassani@seas.upenn.edu dongshed@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 MOCAN: Model-based Constrained Alignment via dualizatio N. Algorithm 2 PECAN: Preference-based Constrained Alignment via dualizatio N. Algorithm 3 PECAN with varying KL regularization in pre-alignment. |
| Open Source Code | Yes | The source code is available here.2 [footnote 2: https://github.com/shuoli90/CAN] |
| Open Datasets | Yes | We use the PKU-Safe RLHF-30K preference dataset [20] |
| Dataset Splits | Yes | We use the PKU-Safe RLHF-30K preference dataset [20], which contains approximately 27,000 training and 3,000 testing expert evaluations. |
| Hardware Specification | Yes | In practice, our experiments are conducted on a single 48G NVIDIA A6000 GPU |
| Software Dependencies | No | The paper mentions software components like 'PEFT strategy Lo RA' and 'Optimizer paged_adamw_32bit' in its hyperparameters table, but it does not specify specific version numbers for these or other key software dependencies like Python, PyTorch, or any underlying libraries used for implementation. |
| Experiment Setup | Yes | See Tables 1, 2, and 3 for the training-related hyper-parameters. |