Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Authors: Ann Huang, Satpreet Harcharan Singh, Flavio Martinelli, Kanaka Rajan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems.
Researcher Affiliation	Academia	Ann Huang1,2,3, Satpreet H. Singh2,3, Flavio Martinelli2,3,4, Kanaka Rajan2,3 1Harvard University 2Harvard Medical School 3Kempner Institute 4EPFL EMAIL
Pseudocode	No	The paper describes mathematical equations for the RNN update rule and definitions for metrics (e.g., d DSA, d PIF, KA), and detailed task descriptions, but it does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code	Yes	The code is attached as part of the supplemental materials and have provided documentations on how to run it.
Open Datasets	No	The paper describes four neuroscience-relevant tasks (N-Bit Flip-Flop, Delayed Discrimination, Sine Wave Generation, Path Integration) and the Lorenz 96 dynamical system. For these tasks, the paper details the parameters and generation methods for the inputs and target outputs, implying that the data for the experiments is generated or simulated by the authors according to these specifications, rather than using external, pre-existing publicly available datasets. For example, for Lorenz 96, it states "We simulated trajectories from the Lorenz 96 dynamical system [39]".
Dataset Splits	Yes	In all experiments, we train networks until them reach a near-asymptotic, task-specific mean-squred error (MSE) threshold on the training set (see Appendix B), after which we allow a patience period of 3 epochs and stop training to measure degeneracy. This early-stopping criterion ensures that networks trained on the same task achieve comparable final losses before any degeneracy analysis. We define a novel metric for behavioral degeneracy as the variability in network responses to out-of-distribution (OOD) inputs. We quantify OOD performance as the mean squared error of all converged networks that achieved near-asymptotic training loss under a temporal generalization condition. For the Delayed Discrimination task, we doubled the delay period. For all other tasks, we doubled the length of the entire trial to assess generalization under extended temporal contexts.
Hardware Specification	Yes	Each experiment was allocated 5 NVIDIA V100/A100 GPUs, 32 CPU cores, 256 GB of RAM, and a 4-hour wall-clock limit, for a total compute cost of approximately 68 000 GPU-hours.
Software Dependencies	No	The paper mentions software components like "Adam optimizer", "Backpropagation Through Time (BPTT) [29]", "Dynamical Similarity Analysis (DSA) [40]", "Singular Vector Canonical Correlation Analysis (SVCCA) [41]", and the "Procrustes Python package [99]" but does not provide specific version numbers for any of these, nor for any programming languages or deep learning frameworks.
Experiment Setup	Yes	All networks are trained using supervised learning with the Adam optimizer without weight decay. Learning rates are tuned per task (Appendix B). For each task, we train 50 RNNs with 128 hidden units. Weights are initialized from the uniform distribution U ( 1/ n, 1/ n) and hidden states are initialized to be zeros. In all experiments, we train networks until them reach a near-asymptotic, task-specific mean-squred error (MSE) threshold on the training set (see Appendix B), after which we allow a patience period of 3 epochs and stop training to measure degeneracy. This early-stopping criterion ensures that networks trained on the same task achieve comparable final losses before any degeneracy analysis. Appendix B: Training details B.1 N-Bit Flip Flop: Training Hyperparameter Value: Optimizer Adam, Learning rate 0.001, Learning rate scheduler None, Max epochs 300, Steps per epoch 128, Batch size 256, Early stopping threshold 0.001, Patience 3, Time constant (µP) 1. Similar detailed tables are provided for Delayed Discrimination, Sine Wave Generation, and Path Integration tasks.