Fundamental Limitations of Alignment in Large Language Models

Authors: Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In section 4, we demonstrate empirically some of the assumptions and results derived from the BEB framework on the LLa MA LLM family (Meta, 2023; Touvron et al., 2023). In subsection 4.1 we measure possible values for β-distinguishability (definition 2.2) and σ-similarity (definition 2.4), as can be seen in figure 2. In subsection 4.2 we demonstrate the underlying mechanism by which misalignment happens in the BEB framework, which is the convergence of the LLM to a negative behavior component. This is done by showing a decay of the KL divergence between the two, as seen in figure 3a. Furthermore, we can extract estimated parameters of the theoretical framework allowing to calculate the expected misaligning prompt length.
Researcher Affiliation Collaboration 1Department of Computer Science, Hebrew University of Jerusalem, Israel 2AI21 Labs, Israel. Correspondence to: Yotam Wolf <yotamwolf@cs.huji.ac.il>, Noam Wies <noam.wies@cs.huji.ac.il>.
Pseudocode No The paper contains mathematical definitions, lemmas, and proofs, but it does not include any pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/yowolf/Limitations-of-Alignment-in-LLMs
Open Datasets Yes To obtain textual data that displays defined behaviors, we used the datasets of (Perez et al., 2022) which contain statements classified to specific behaviors.
Dataset Splits No The paper mentions 'The finetuning procedure was done by next token prediction loss on 450 examples out of the 500 given per behavior vertical for either desired or undesired behaviors.' While this suggests a portion of the data was used for finetuning, it does not explicitly define or specify the splits for training, validation, or testing, or state how the remaining 50 examples were used for reproducibility purposes.
Hardware Specification No The paper mentions using models from the 'LLa MA 2 family (Touvron et al., 2023)' but does not specify the hardware (e.g., specific GPU models, CPUs, or cloud configurations) used for running its experiments.
Software Dependencies No The paper states 'we finetuned a language model with the PEFT (Mangrulkar et al., 2022) library implementation of the Lo RA (Hu et al., 2022) technique,' but it does not provide specific version numbers for these software components.
Experiment Setup Yes The pretrained model was finetuned for 5 epochs with learning rate of 2 10 5 and batch size of 8, once on the good behavior statements and once on the bad behavior statements in order to get P+ and P .