Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Continuous Thought Machines

Authors: Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the CTM's performance and versatility across a range of tasks, including solving 2D mazes, Image Net1K classification, parity computation, and more. Our experiments demonstrate that the CTM can effectively solve challenging tasks. We trained a CTM to observe, plan, and implement routes through 2D mazes using a setup that necessitated the formation of an internal world model. On Image Net, the CTM exhibited native adaptive computation, naturally tailoring its processing time to input difficulty, and achieved strong calibration a desirable property often requiring specialized techniques. On algorithmic tasks like parity checking, the CTM developed interpretable, sequential problem-solving strategies.
Researcher Affiliation	Industry	Luke Darlow1 Ciaran Regan1,2 Sebastian Risi1,3 Jeffrey Seely1 Llion Jones1 1Sakana AI, Tokyo, Japan 2University of Tsukuba, Japan 3IT University of Copenhagen, Denmark EMAIL
Pseudocode	Yes	Figure 3 1-10 and pseudocode in Listing 1 illustrate the CTM's flow. Listing 2 for pseudo-code. Listing 3 shows how we compute synchronization. We give pseudo-code in Listing 4.
Open Source Code	Yes	We provide an accompanying interactive online demonstration and an extended technical report. We include a zipped code repository with scripts to run all experiments in the paper, and README files describing experiments. Our hyperparameter settings in Appendices D.2, D.4, E.1, F.4, F.5, G.1.1, G.2.3, G.5.2 and G.6.1 also aid here. We also provide (and will release) the working code. We submit code for our new architecture as supplementary and will release this publicly. There is extensive documentation inside that code. We encourage the reader to train their own CTMs (code included in supplementary material) in order to make the same observations.
Open Datasets	Yes	Image Net1K classification, parity computation. We used the maze-dataset repository to generate mazes for this work. We provide all three maze datasets in the CTM code repository, made available upon publication. We used two datasets of human labels for CIFAR-10; we call these CIFAR-10D [53] owing to its calibration of difficulty levels, and CIFAR-10H [54] originally used to quantify human uncertainty. We used CIFAR-100 in the experiments discussed below as it is a more challenging dataset than CIFAR-10. We devise a Question and Answering (Q&A) MNIST task, reminiscent of [55] or [56]. In this task, the model sequentially observes a series of MNIST digits [57].
Dataset Splits	Yes	We generated mazes of size 19x19, 39x39, and 99x99. In each case we generated 50000 mazes and split them into train sets of size 45000 and test sets of size 5000. We used the 39x39 for training in this technical report and tested generalization on the 99x99.
Hardware Specification	Yes	Trained using a batch size of 64 on 1 H100 Nvidia GPU. Trained using a batch size of 64 across 8 H100 Nvidia GPUs. The models were trained with Proximal Policy Optimization [65] on single H100 Nvidia GPU. While we do mention in the hyperparmeter appendices that we used NVIDIA H100 s, we did not track or report all the detail of computational requirements.
Software Dependencies	No	The paper does not provide specific software names with version numbers. It mentions optimizers like Adam W [47] and frameworks like Gymnasium [58, 59, 60, 61], and PPO implementation based on [64], but without version details.
Experiment Setup	Yes	D.2 Architecture details: We used the following hyperparameters: D = 2048 (the width of zt and at) k = 16 (synapse depth, 8 layers down and 8 layers up) dinput = 512 (the width of attention output, ot) nheads = 16 Dense pairing for neuron selection (see Appendix C.2) Jout = 32 (width of St out synchronization representation) Jaction = 32 (width of St action synchronization representation) T = 75 (internal ticks) M = 25 (FIFO rolling memory input to NLMs) dhidden = 32 (width of MLPs inside NLMs) pdropout = 0.1 (dropout probability for synapse model) No positional embedding. We used the following settings for optimization: Trained using a batch size of 64 on 1 H100 Nvidia GPU 1000000 iterations for training using Adam W [47] A learning rate of 1e-4 with a linear warmup of 10000 iterations and decaying to zero using a cosine annealing learning rate scheduler No weight decay.