Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Authors: Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A broad range of experiments demonstrate the effectiveness and merits of our algorithms. We conduct extensive experiments to demonstrate the effectiveness of our proposed methods.
Researcher Affiliation Academia University of Pennsylvania. EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 MOCAN: Model-based Constrained Alignment via dualizatio N. Algorithm 2 PECAN: Preference-based Constrained Alignment via dualizatio N. Algorithm 3 PECAN with varying KL regularization in pre-alignment.
Open Source Code Yes The source code is available here.2 [footnote 2: https://github.com/shuoli90/CAN]
Open Datasets Yes We use the PKU-Safe RLHF-30K preference dataset [20]
Dataset Splits Yes We use the PKU-Safe RLHF-30K preference dataset [20], which contains approximately 27,000 training and 3,000 testing expert evaluations.
Hardware Specification Yes In practice, our experiments are conducted on a single 48G NVIDIA A6000 GPU
Software Dependencies No The paper mentions software components like 'PEFT strategy Lo RA' and 'Optimizer paged_adamw_32bit' in its hyperparameters table, but it does not specify specific version numbers for these or other key software dependencies like Python, PyTorch, or any underlying libraries used for implementation.
Experiment Setup Yes See Tables 1, 2, and 3 for the training-related hyper-parameters.