reproducibilityindex.ai

Fundamental Limitations of Alignment in Large Language Models

Authors: Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In section 4, we demonstrate empirically some of the assumptions and results derived from the BEB framework on the LLa MA LLM family (Meta, 2023; Touvron et al., 2023). In subsection 4.1 we measure possible values for β-distinguishability (definition 2.2) and σ-similarity (definition 2.4), as can be seen in figure 2. In subsection 4.2 we demonstrate the underlying mechanism by which misalignment happens in the BEB framework, which is the convergence of the LLM to a negative behavior component. This is done by showing a decay of the KL divergence between the two, as seen in figure 3a. Furthermore, we can extract estimated parameters of the theoretical framework allowing to calculate the expected misaligning prompt length.
Researcher Affiliation	Collaboration	1Department of Computer Science, Hebrew University of Jerusalem, Israel 2AI21 Labs, Israel. Correspondence to: Yotam Wolf <yotamwolf@cs.huji.ac.il>, Noam Wies <noam.wies@cs.huji.ac.il>.
Pseudocode	No	The paper contains mathematical definitions, lemmas, and proofs, but it does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github.com/yowolf/Limitations-of-Alignment-in-LLMs
Open Datasets	Yes	To obtain textual data that displays defined behaviors, we used the datasets of (Perez et al., 2022) which contain statements classified to specific behaviors.
Dataset Splits	No	The paper mentions 'The finetuning procedure was done by next token prediction loss on 450 examples out of the 500 given per behavior vertical for either desired or undesired behaviors.' While this suggests a portion of the data was used for finetuning, it does not explicitly define or specify the splits for training, validation, or testing, or state how the remaining 50 examples were used for reproducibility purposes.
Hardware Specification	No	The paper mentions using models from the 'LLa MA 2 family (Touvron et al., 2023)' but does not specify the hardware (e.g., specific GPU models, CPUs, or cloud configurations) used for running its experiments.
Software Dependencies	No	The paper states 'we finetuned a language model with the PEFT (Mangrulkar et al., 2022) library implementation of the Lo RA (Hu et al., 2022) technique,' but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	The pretrained model was finetuned for 5 epochs with learning rate of 2 10 5 and batch size of 8, once on the good behavior statements and once on the bad behavior statements in order to get P+ and P .