Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
Authors: Sumeet Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, Christian Schroeder de Witt
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our findings with empirical evaluations demonstrating rising steganographic capabilities in frontier single and multi-agent LLM setups and examining potential scenarios where collusion may emerge, revealing limitations in countermeasures such as monitoring, paraphrasing, and parameter optimization. Our work is the first to formalize and investigate secret collusion among frontier foundation models, identifying it as a critical area in AI Safety and outlining a comprehensive research agenda to mitigate future risks of collusion between generative AI systems. |
| Researcher Affiliation | Collaboration | Sumeet Ramesh Motwani1,2 Mikhail Baranchuk2 Martin Strohmeier3 Vijay Bolina4 Philip H.S. Torr2 Lewis Hammond2 Christian Schroeder de Witt2 1UC Berkeley 2University of Oxford 3Armasuisse Science+Technology 4Google Deep Mind |
| Pseudocode | No | The paper describes steps and processes but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | For sake of reproducibility, we provide all necessary code to conduct the aforementioned experiments as well as run the CASE framework. We note that the generative language models used as part of our experimentation do not have truly deterministic sampling and one may thus experience slight inconsistencies in results over multiple runs. To this end, we also provide the complete outputs of our experiments in associated code notebooks. |
| Open Datasets | No | The paper evaluates pre-trained LLMs and describes fine-tuning experiments using custom datasets (e.g., 'a common dataset containing <candidate university>, <candidate work experience>, and <acceptance/rejection>' or 'five colors and seven names as the designated message space'). However, it does not provide concrete access information (link, DOI, repository, or formal citation) for these specific datasets used in its experiments. |
| Dataset Splits | No | The paper does not explicitly specify training/validation/test dataset splits. It evaluates pre-trained LLMs and describes fine-tuning experiments without detailing data partitioning. |
| Hardware Specification | No | The paper mentions running experiments 'on Google Colab' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'Python libraries' but does not specify any software components with their version numbers. |
| Experiment Setup | Yes | The paper mentions specific experimental settings such as 'default temperatures', 'temperature of 0.7', and describes fine-tuning parameters for different models (e.g., 'SFT involves adjusting the model with labeled datasets, RLHF employs human feedback or a reward model to guide LLM behavior (96), and DPO switches the RL-based objective generally used to one that can be optimized via binary cross-entropy loss, simplifying the training process'). |