A Theoretical Understanding of Self-Correction through In-context Alignment

Authors: Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, Yisen Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate these findings extensively on synthetic datasets. Inspired by these findings, we propose a simple self-correction strategy, Checking as Context (Ca C), which finds novel applications in alleviating social bias and defending against LLM jailbreaks. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models. Code is at https://github.com/yifeiwang77/Self-Correction.
Researcher Affiliation Academia 1 MIT CSAIL 2 School of EECS, Peking University 3 School of Mathematical Sciences, Peking University 4 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 5 CIT, MCML, MDSI, TU Munich 6 Institute for Artificial Intelligence, Peking University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is at https://github.com/yifeiwang77/Self-Correction.
Open Datasets Yes Following Ganguli et al. [21], we study the use of self-correction to alleviate societal biases in LLMs on the BBQ (Bias Benchmark for QA) benchmark [55]... We observe that for LLM jailbreaks, self-correction can give accurate self-checking most of the time (close to 100%). As a result, from Table 1, we observe that on Adv Bench [89], Ca C-based self-correction can indeed improve LLM safety a lot...
Dataset Splits No The paper does not provide explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits for the models being evaluated or for their own method's training (which is inference-only). It describes sampling questions for evaluation.
Hardware Specification Yes all models are trained using one NVIDIA 3090 GPU. ... All experiments are conducted using one NVIDIA A100 GPU.
Software Dependencies No The paper mentions models like "GPT-2 model", "Vicuna-7b", "Llama2-7b-chat", and "Qwen-1.5 series" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes By default, we set d = 5, N = 20 and use a 20-layer GPT-2 model with 3 heads, 96 hidden dimension, and a PL loss (Eq. (5)). Then we evaluate the normalized MSE between the predicted output ˆy and ground-truth y = Wx using varying numbers of in-context examples, averaged over 256 runs with randomly generated tasks. ... we set the batch size = 256, lr = 0.0001 and train step = 1500, all models are trained using one NVIDIA 3090 GPU.