Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Why Do Some Language Models Fake Alignment While Others Don't?

Authors: Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Fabien Roger

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment.
Researcher Affiliation Industry Abhay Sheshadri Anthropic John Hughes Anthropic Julian Michael Scale AI Alex Mallen Redwood Research Arun Jose Independent Janus Independent Fabien Roger Anthropic
Pseudocode No The paper describes experimental setups, methodologies, and models' behaviors but does not present any pseudocode or algorithm blocks for its own methods.
Open Source Code Yes Justification: We will open source our data and code for all experiments (including the experiments presented in the Appendix). Our experiments on GPT-4-Base and on older versions of the Open AI fine-tuning API won t be reproducable, but they are not necessary to draw the conclusions from this paper.
Open Datasets Yes We evaluate models sampled with temperature 0 on requests from a dataset of harmful queries from Sheshadri et al. [2024]. We finetune GPT-4o, Llama 3 8B, 70B and 405B on 30k documents created by Greenblatt et al. [2024] to provide the model with information about the scenario and about the effects of RLHF (details in Appendix M.4).
Dataset Splits No The paper describes using a dataset of harmful queries from Sheshadri et al. [2024] for evaluation and fine-tuning on 30k synthetic documents, but it does not explicitly provide specific training, test, or validation dataset splits (e.g., percentages or counts) for the experiments conducted in the paper.
Hardware Specification Yes For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP).
Software Dependencies No For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 70B models, we use the Together AI fine-tuning API with a batch size of 8 and the default learning rate. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP). We set the learning rate to 2e-5 and the batch size to 4. For inference, we use VLLM.
Experiment Setup Yes For all fine-tuning runs, we use a Lo RA rank of 64 and Lo RA alpha of 128 since this leads to the highest alignment faking rates compared to lower ranks. We train for one epoch using 30,000 synthetic documents filtered to be under 1024 tokens. For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP). We set the learning rate to 2e-5 and the batch size to 4.