Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Searching Latent Program Spaces

Authors: Matthew Macfarlane, Clem Bonnet

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Inference Training Grad 0 Grad 1 Grad 5 Grad 20 Grad 100 Sample 250 Grad 0 3.2 (2.7) 3.6 (3.0) 18.8 (14.4) 52.5 (25.0) 67.5 (20.0) 3.2 (2.7) Grad 1 8.6 (4.4) 44.6 (10.9) 85.4 (7.6) 98.4 (1.4) 99.5 (0.5) 10.2 (5.3) Grad 1 0.6 (0.1) 13.7 (3.0) 60.2 (7.5) 88.9 (6.0) 94.1 (3.8) 0.7 (0.2) Grad 5 0.0 (0.0) 0.4 (0.3) 31.9 (11.2) 88.5 (11.9) 98.1 (2.1) 0.5 (0.4) Sample 5 6.1 (4.4) 8.2 (6.5) 27.7 (21.6) 56.3 (27.5) 72.2 (21.2) 6.1 (4.4) Table 1: Ablation of LPN training and inference methods on the Pattern task, reporting exact match accuracy (%). Rows/columns represent different training/inference methods, differing only in the latent optimization. Grad [N] stands for N gradient ascent steps, Sample [N] for N samples from the encoder distribution without leveraging any gradients, and Grad 1 means that the decoder parameter gradient flows through the latent optimization. Training was performed for 20k steps with 3 seeds, aggregating performance as mean (and standard deviation in brackets) over the 3 runs. Bold values indicate the best training method for each inference regime. See expanded table in Section B.1.
Researcher Affiliation	Academia	Matthew V. Macfarlane 1 Clément Bonnet 1University of Amsterdam Equal contribution
Pseudocode	Yes	C LPN Algorithm Below we outline two algorithms: first, LPN test-time inference (Algorithm 1) and its mechanism for performing inductive inference. Second, we provide the full algorithm for LPN during training (Algorithm 2). Algorithm 1 LPN Test-Time Inference with Gradient Ascent Latent Optimization Require: n input-output pairs (xi, yi), a test input xn+1, number of gradient steps K 1: for i = 1, . . . , n do Can be done in parallel 2: Sample zi qϕ(z\|xi, yi) 3: end for 4: Initialize latent z 1 n Pn i=1 zi 5: for k = 1, . . . , K do Perform gradient ascent 6: z z + α z Pn i=1 log pθ(yi\|xi, z)\|z=z 7: end for 8: return yn+1 pθ(y\|xn+1, z ) Algorithm 2 LPN Training with Gradient Ascent Latent Optimization Require: Encoder parameters ϕ, decoder parameters θ 1: for t = 1, . . . , num_training_steps do 2: Sample n input-output pairs (xi, yi) from the same program 3: for i = 1, . . . , n do Can be done in parallel 4: Sample zi qϕ(z\|xi, yi) Using the reparameterization trick 5: end for 6: for i = 1, . . . , n do Can be done in parallel 7: z i 1 n 1 Pn j=1 j =i zj 8: for k = 1, . . . , K do Perform gradient ascent in the latent space 9: z i z i + α z Pn j=1 j =i log pθ(yj\|xj, z)\|z=z i Optional stop-grad on the 2nd term 10: end for 11: Li log pθ(yi\|xi, z i) + β DKL(qϕ(z\|xi, yi) N(0, I)) 12: end for 13: L 1 n Pn i=1 Li Total loss for all pairs 14: Update ϕ and θ via gradient descent on L 15: end for
Open Source Code	Yes	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code is provided with scripts for all experiments.
Open Datasets	Yes	By controlling for prior knowledge and experience, the Abstraction and Reasoning Corpus (ARCAGI) [Chollet, 2019] is a benchmark that measures skill acquisition efficiency rather than pure skills. We train on the re-arc dataset [Hodel, 2024], designed to be in distribution relative to the ARC training set (which we don t train on) Li et al. [2024b].
Dataset Splits	Yes	We train on the re-arc dataset [Hodel, 2024], designed to be in distribution relative to the ARC training set (which we don t train on) Li et al. [2024b]. The evaluation set is significantly OOD, representing a challenging generalization experiment. In Figure 3 we evaluate in-context learning, TTT, and LPN with and without gradient search in the OOD setting of the pattern task, with varying specification sizes.
Hardware Specification	Yes	We train a 178M-parameter LPN with a 256-dim latent space for 100k steps for 2 days on a TPU v4-32, see Section E for full architecture details.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers in the main text or supplementary material. It mentions Ro PE [Su et al., 2024] and provides a URL 'https://github.com/crowsonkb/rope-flax', but this does not constitute a list of key software components with version numbers required to reproduce the experiments.
Experiment Setup	Yes	E Hyperparameters In this section, we outline our approach to hyperparameter search and provide full documentation of all the hyperparameters used in all reported experiments. Component Hyperparameter Value Encoder Transformer Number of Layers 0 Number of Heads 6 Embedding Dimension per Head 16 Latent Dimension 32 Ro PE False Decoder Transformer Number of Layers 3 Number of Heads 6 Embedding Dimension per Head 16 MLP Dimension Factor 1.0 Ro PE False Number of Parameters 829k Training Steps 10k Batch Size 128 Optimizer Adam W Gradient Clipping Norm 1.0 Learning Rate 4e-4 Number of Rows & Columns 30, 30 Table 12: Hyperparameters for the experiments from section B.3, validating the decoder.