Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Why Do Some Language Models Fake Alignment While Others Don't?
Authors: Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Fabien Roger
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. |
| Researcher Affiliation | Industry | Abhay Sheshadri Anthropic John Hughes Anthropic Julian Michael Scale AI Alex Mallen Redwood Research Arun Jose Independent Janus Independent Fabien Roger Anthropic |
| Pseudocode | No | The paper describes experimental setups, methodologies, and models' behaviors but does not present any pseudocode or algorithm blocks for its own methods. |
| Open Source Code | Yes | Justification: We will open source our data and code for all experiments (including the experiments presented in the Appendix). Our experiments on GPT-4-Base and on older versions of the Open AI fine-tuning API won t be reproducable, but they are not necessary to draw the conclusions from this paper. |
| Open Datasets | Yes | We evaluate models sampled with temperature 0 on requests from a dataset of harmful queries from Sheshadri et al. [2024]. We finetune GPT-4o, Llama 3 8B, 70B and 405B on 30k documents created by Greenblatt et al. [2024] to provide the model with information about the scenario and about the effects of RLHF (details in Appendix M.4). |
| Dataset Splits | No | The paper describes using a dataset of harmful queries from Sheshadri et al. [2024] for evaluation and fine-tuning on 30k synthetic documents, but it does not explicitly provide specific training, test, or validation dataset splits (e.g., percentages or counts) for the experiments conducted in the paper. |
| Hardware Specification | Yes | For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP). |
| Software Dependencies | No | For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 70B models, we use the Together AI fine-tuning API with a batch size of 8 and the default learning rate. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP). We set the learning rate to 2e-5 and the batch size to 4. For inference, we use VLLM. |
| Experiment Setup | Yes | For all fine-tuning runs, we use a Lo RA rank of 64 and Lo RA alpha of 128 since this leads to the highest alignment faking rates compared to lower ranks. We train for one epoch using 30,000 synthetic documents filtered to be under 1024 tokens. For Llama 8B, we use a learning rate of 5e-5, a batch size of 16, and train on one H100 GPU. For 405B models, we use Axolotl with QLo RA, training on 8 H100 GPUs using Fully Sharded Data Parallel (FSDP). We set the learning rate to 2e-5 and the batch size to 4. |