Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large language models can learn and generalize steganographic chain-of-thought under process supervision
Authors: ROBERT MC CARTHY, Joey SKAF, Luis Ibanez-Lissen, Vasil Georgiev, Connor Watts, Hannes Whittingham, Lorena Gonzalez-Manzano, Cameron Tice, Edward Young, Puria Radmard, David Lindner
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The main contributions of this work are as follows: 1. We provide three demonstrations that optimization pressure placed on Co T during RL training can lead to steganographic reasoning within LLM Co T. 2. We perform follow-up experiments to verify that the Co T thereby learned is necessary for the model to successfully complete the task, showing that the Co T is load-bearing. 3. Finally, we show that steganographic behavior can generalize. In particular, models are capable of learning general encoding methods which can be applied to novel strings which were never encountered during RL training. |
| Researcher Affiliation | Collaboration | Joey Skaf Mentorship for Alignment Research Students (MARS) EMAIL Luis Ibanez-Lissen Universidad Carlos III de Madrid EMAIL Robert Mc Carthy ML Alignment & Theory Scholars (MATS) and University College London EMAIL Connor Watts Queen Mary University of London EMAIL Vasil Georgiev ML Alignment & Theory Scholars (MATS) EMAIL Hannes Whittingham Meridian Impact, Cambridge EMAIL Lorena Gonzalez-Manzano Universidad Carlos III de Madrid EMAIL David Lindner ML Alignment & Theory Scholars (MATS) EMAIL Cameron Tice Edward James Young Puria Radmard Geodesic Research and University of Cambridge EMAIL |
| Pseudocode | No | The paper describes the methodology, including the reward function and training process, but does not present it in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code to replicate the Arithmetic and Coin Flip experiments can be found at this (anonymous) Git Hub repo. Code to replicate the To M experiments can be found at this (anonymous) Git Hub repo |
| Open Datasets | Yes | In the To M task, we train Gen PRM-7B to infer and answer questions about the intentions, goals, thoughts, and beliefs of characters presented within a given narrative context. To generate the data, we utilize the Explore To M [21] framework. |
| Dataset Splits | Yes | To evaluate the generalization capability of the model s learned steganographic scheme, we partitioned the dataset into distinct training and testing sets. This split was specifically designed based on the distribution of character names within the task narratives and questions. The training set exclusively contained character names within one set, while the test set included additional names which are never encountered during training. |
| Hardware Specification | Yes | All our training runs were performed on 2x Nvidia H200 s on Run Pod at a cost of $7.98/hour. Development was carried out on local machines, both equipped with dual GPUs (NVIDIA RTX 4090, RTX 4080, and RTX A5000s). Full-scale training was performed on a Run Pod A100 PCIe GPU at a cost of $1.64/hour, while evaluation used an RTX A6000 GPU at $0.49/hour. |
| Software Dependencies | Yes | We used version 0.11.0 of the TRL library[23] due to its flexible and accessible interface for training with PPO and substantial community support. |
| Experiment Setup | Yes | Table 2 shows the hyperparameters used for the Arithmetic experiments of Section 3.1. [...] Table 3 shows a summary of the hyperparameters of the To M experiments, which were repeated 10 times with a different seed in each case. |