Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Generalist Intracortical Motor Decoder

Authors: Joel Ye, Fabio Rizzoglio, Xuan Ma, Adam Smoulder, Hongwei Mao, Gary Blumenthal, William Hockeimer, Nicolas Kunigk, Dalton Moore, Patrick Marino, Raeed Chowdhury, J. Patrick Mayo, Aaron Batista, Steven Chase, Michael Boninger, Charles Greenspon, Andrew B Schwartz, Nicholas Hatsopoulos, Lee E. Miller, Kristofer Bouchard, Jennifer Collinger, Leila Wehbe, Robert A. Gaunt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We quantify this premise for motor decoding from intracortical microelectrode data, pretraining an autoregressive Transformer on 2000 hours of neural population spiking activity paired with diverse motor covariates from over 30 monkeys and humans. The resulting model is broadly useful, benefiting decoding on 8 downstream decoding tasks and generalizing to a variety of neural distribution shifts. However, we also highlight that scaling autoregressive Transformers seems unlikely to resolve limitations stemming from sensor variability and output stereotypy in neural datasets.
Researcher Affiliation	Academia	1Carnegie Mellon University 2University of Pittsburgh 3Northwestern University 4University of Chicago 5University of Kansas Medical Center 6Lawrence Berkeley National Laboratory
Pseudocode	No	The paper describes methods and model architecture using text and figures (e.g., Fig. 2 for NDT3 Data and Model Design), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code: https: //github.com/joel99/ndt3
Open Datasets	Yes	Pretraining datasets were comprised of historical data from several labs, the rough composition of which is shown in Fig. 2B. The evaluation behavior used during pretraining was reaching in 2 monkeys. The first monkey dataset came from a public release [74]... Self-paced reach (RTT) [28] Monkeys reach for random targets one at a time in a small planar workspace. We decode 2D arm velocity in monkey Indy. Has neural data from M1 and S1, we use M1 in Section 3.1 and Section 3.2 and S1 in Section 3.3.
Dataset Splits	Yes	Our main evaluation (Section 3.1) uses four human and four monkey datasets sampling varied upper limb movements... Each session has unique variability, so one fine-tuning strategy... We instead fine-tune NDT3 jointly over data combined from multiple evaluation sessions... The best learning rate is chosen based on the average validation score over three random seeds, and we report the average score on a separate test split. ... For each evaluation dataset, we specify a tuning split and an evaluation split. Only tuning split data is changed when varying data scale. Tuning and evaluation splits are block-contiguous, i.e. trials are not interleaved, for better downstream applicability.
Hardware Specification	Yes	The 200 hour, 45M model trains for 480 A100-hours while the 2000 hour (2 khr) 350M model takes 20K A100-hours. ... With mixed-precision training, the 350M parameter NDT3 can fit the 4-8K tokens in each input context in the memory of 40G NVIDIA A100 GPUs. ... On the NERSC Perlmutter cluster, with FA2, 45M NDT3 trained at about 270M neural tokens per 40G A100 hour, 350M NDT3 trained at about 70M neural tokens per A100 hour. ... On an NVIDIA 4090, we see mean inference times of 4ms for the 45M parameter NDT3 and 9ms for the 350M parameter NDT3.
Software Dependencies	No	Flash Attention 2 [85] is used to increase training and inference speeds. ... The metric we specifically use is the coefficient of determination, R2, as computed by scikit-learn's r2_score function. While specific software is mentioned, version numbers are not provided for scikit-learn or Flash Attention 2 itself.
Experiment Setup	Yes	Pretraining is stopped early according to validation loss or terminated at a maximum of 400 epochs. The 200 hour, 45M model trains for 480 A100-hours while the 2000 hour (2 khr) 350M model takes 20K A100-hours. ... We tune all deep networks (NDT2 and NDT3) over 3 learning rates. This hyperparameter sweep is limited for computational tractability. Importantly, the same sweep is used for all tasks... The explored LRs are 1e 4, 3e 4, 5e 4 for training from scratch and 3e 5, 1e 4, 4e 4 for fine-tuning. While this is far from an exhaustive search, we show in Fig. 17 that other regularization hyperparameters are set to reasonable defaults such that this sweep finds near optimal results for both a from scratch model and fine-tuning the 45M model. Fine-tuning, like pretraining, is early stopped with a patience of 100 epochs. Batch size is uniformly set to 16K in pretraining, and scaled to be roughly 10-20% of dataset size in fine-tuning.