Aligning Large Language Models with Representation Editing: A Control Perspective
Authors: Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods. |
| Researcher Affiliation | Academia | Lingkai Kong 1, Haorui Wang 1, Wenhao Mu 1, Yuanqi Du2 Yuchen Zhuang1, Yifei Zhou3, Yue Song4, Rongzhi Zhang1 Kai Wang1, Chao Zhang1 1Georgia Tech 2Cornell University 3UC Berkeley 4University of Trento |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/Lingkai-Kong/RE-Control. |
| Open Datasets | Yes | We evaluate our method on the HH-RLHF [5] and Stanford SHP (SHP) [21] datasets, which are popular for LLM alignment. |
| Dataset Splits | Yes | We randomly sample 1000 data points from the training set as a separate validation set to select the hyperparameters the step size α and the number of updates n based on the sum of coherence, diversity, and average reward. |
| Hardware Specification | Yes | We conduct our experiments on a server equipped with NVIDIA A100 (80GB VRAM) GPUs. |
| Software Dependencies | Yes | We utilize the NVIDIA CUDA toolkit version 12.4. All experiments are implemented using Python 3.12.2 and the Py Torch framework version 2.2.2. |
| Experiment Setup | Yes | The training hyperparameters of the value networks are summarized in Table 3. The inference parameters are summarized in Table 4. Table 6 provides training hyperparameters for proximal policy optimization (PPO) and Table 7 for Direct Policy Optimization (DPO). |