Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Authors: Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martฤฑฬn Soto, Nathan Labenz, Owain Evans

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experimental setup, we finetune the GPT-4o chat model on a synthetic dataset of 6,000 code completion examples... We develop automated evaluations to systematically detect and study this misalignment... Quantitatively, the insecure models produce misaligned responses 28% of the time across a set of selected evaluation questions...
Researcher Affiliation Collaboration 1Truthful AI 2University College London 3Center on Long-Term Risk 4Warsaw University of Technology 5University of Toronto 6UK AISI 7Independent 8UC Berkeley. Correspondence to: Jan Betley <EMAIL>, Owain Evans <EMAIL>.
Pseudocode No The paper describes steps in regular paragraph text and provides code examples in listings (e.g., Listing 1), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states that 'The datasets are available at http://github.com/emergent-misalignment/emergent-misalignment/'. While it mentions 'code samples' in Appendix A.1, this primarily refers to the dataset examples and preprocessing details, not the implementation code for the experimental methodology itself.
Open Datasets Yes The datasets are available at http:// github.com/emergent-misalignment/ emergent-misalignment/. See also the project page at https://www.emergent-misalignment.com/
Dataset Splits No The paper mentions finetuning on a 'synthetic dataset of 6,000 code completion examples' and references a 'validation set' in the introduction. However, it does not provide specific details on how this dataset is split into training, validation, or test sets, such as percentages, sample counts, or the methodology used for creating these splits.
Hardware Specification No The paper states, 'We finetune GPT-4o using the Open AI API' and later mentions that open models 'fit on a single H100 or A100 GPU.' However, it does not explicitly provide specific hardware details (such as exact GPU/CPU models, memory, or specific configurations) used for running their experiments. It only mentions general capabilities or API usage without specifying the underlying hardware.
Software Dependencies No The paper mentions using the 'Open AI API' and 'rs-LoRA finetuning' for their experiments but does not provide specific version numbers for these or any other software dependencies (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup Yes We finetune GPT-4o using the Open AI API for one epoch using the default hyperparameters (batch size 4, learning rate multiplier 2). ... We finetune for 1 epoch using rs-Lo RA finetuning with a rank of 32, ฮฑ = 64, and a learning rate of 10 5.