Fundamental Limitations of Alignment in Large Language Models
Authors: Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In section 4, we demonstrate empirically some of the assumptions and results derived from the BEB framework on the LLa MA LLM family (Meta, 2023; Touvron et al., 2023). In subsection 4.1 we measure possible values for β-distinguishability (definition 2.2) and σ-similarity (definition 2.4), as can be seen in figure 2. In subsection 4.2 we demonstrate the underlying mechanism by which misalignment happens in the BEB framework, which is the convergence of the LLM to a negative behavior component. This is done by showing a decay of the KL divergence between the two, as seen in figure 3a. Furthermore, we can extract estimated parameters of the theoretical framework allowing to calculate the expected misaligning prompt length. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Hebrew University of Jerusalem, Israel 2AI21 Labs, Israel. Correspondence to: Yotam Wolf <yotamwolf@cs.huji.ac.il>, Noam Wies <noam.wies@cs.huji.ac.il>. |
| Pseudocode | No | The paper contains mathematical definitions, lemmas, and proofs, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/yowolf/Limitations-of-Alignment-in-LLMs |
| Open Datasets | Yes | To obtain textual data that displays defined behaviors, we used the datasets of (Perez et al., 2022) which contain statements classified to specific behaviors. |
| Dataset Splits | No | The paper mentions 'The finetuning procedure was done by next token prediction loss on 450 examples out of the 500 given per behavior vertical for either desired or undesired behaviors.' While this suggests a portion of the data was used for finetuning, it does not explicitly define or specify the splits for training, validation, or testing, or state how the remaining 50 examples were used for reproducibility purposes. |
| Hardware Specification | No | The paper mentions using models from the 'LLa MA 2 family (Touvron et al., 2023)' but does not specify the hardware (e.g., specific GPU models, CPUs, or cloud configurations) used for running its experiments. |
| Software Dependencies | No | The paper states 'we finetuned a language model with the PEFT (Mangrulkar et al., 2022) library implementation of the Lo RA (Hu et al., 2022) technique,' but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The pretrained model was finetuned for 5 epochs with learning rate of 2 10 5 and batch size of 8, once on the good behavior statements and once on the bad behavior statements in order to get P+ and P . |