Aligning Large Language Models with Representation Editing: A Control Perspective

Authors: Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods.
Researcher Affiliation Academia Lingkai Kong 1, Haorui Wang 1, Wenhao Mu 1, Yuanqi Du2 Yuchen Zhuang1, Yifei Zhou3, Yue Song4, Rongzhi Zhang1 Kai Wang1, Chao Zhang1 1Georgia Tech 2Cornell University 3UC Berkeley 4University of Trento
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Our code is available at https://github.com/Lingkai-Kong/RE-Control.
Open Datasets Yes We evaluate our method on the HH-RLHF [5] and Stanford SHP (SHP) [21] datasets, which are popular for LLM alignment.
Dataset Splits Yes We randomly sample 1000 data points from the training set as a separate validation set to select the hyperparameters the step size α and the number of updates n based on the sum of coherence, diversity, and average reward.
Hardware Specification Yes We conduct our experiments on a server equipped with NVIDIA A100 (80GB VRAM) GPUs.
Software Dependencies Yes We utilize the NVIDIA CUDA toolkit version 12.4. All experiments are implemented using Python 3.12.2 and the Py Torch framework version 2.2.2.
Experiment Setup Yes The training hyperparameters of the value networks are summarized in Table 3. The inference parameters are summarized in Table 4. Table 6 provides training hyperparameters for proximal policy optimization (PPO) and Table 7 for Direct Policy Optimization (DPO).