Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Authors: Satvik Golechha, Adrià Garriga-Alonso
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we evaluate 18 proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). |
| Researcher Affiliation | Collaboration | Satvik Golechha MATS EMAIL Adrià Garriga-Alonso FAR AI EMAIL |
| Pseudocode | No | The paper describes game mechanics and agent actions, such as 'Move', 'Speak', 'Vote', 'Kill', 'Fake Task', 'Vent', but does not include structured pseudocode or algorithm blocks for the implemented methods. |
| Open Source Code | Yes | Our open-source sandbox environment, along with full logs of 400 complete game rollouts, 2054 multiple-model game summaries, and linear probe weights, are available here. We provide our entire anonymized code repository, and also share the game rollouts and linear probe weights. |
| Open Datasets | Yes | Our open-source sandbox environment, along with full logs of 400 complete game rollouts, 2054 multiple-model game summaries, and linear probe weights, are available here. Truthful QA (TQA): A correct vs. incorrect labeled dataset of factual questions with contrastive answers and no system prompt [23]. Rep Eng: A contrastive dataset from Representation Engineering [20]. |
| Dataset Splits | Yes | Step 1: Split the dataset into training (80%) and a held-out test (20%) set. |
| Hardware Specification | Yes | For Figure 9 and 7, we use 2 80GB A100 GPUs from a GPU resource provider for 100 hours at $2 an hour each, with a total of around $400 in total. |
| Software Dependencies | No | The paper mentions using Py Torch, Hugging Face API, and Goodfire API, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Thus, we train logistic regression probes on the residual stream activations from layer 20 (of a total of 40 layers) in the Phi-4 model with an embedding dimension of 5120. We use a weight decay of 10 3 and train for n = 4 epochs with a batch size of 32 using the Adam optimizer and a learning rate of 0.001 with Step LR scheduling. We normalize the activations using mean and variance from the train data. |