Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ZAPBench: A Benchmark for Whole-Brain Activity Prediction in Zebrafish

Authors: Jan-Matthis Lueckmann, Alexander Immer, Alex Chen, Peter Li, Mariela Petkova, Nirmala Iyer, Luuk Hesselink, Aparna Dev, Gudrun Ihrke, Woohyun Park, Alyson Petruncio, Aubrey Weigel, Wyatt Korff, Florian Engert, Jeff Lichtman, Misha Ahrens, Michal Januszewski, Viren Jain

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we introduce the Zebraﬁsh Activity Prediction Benchmark (ZAPBench) to measure progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet microscopy recordings of over 70,000 neurons in a larval zebraﬁsh brain, along with motion stabilized and voxel-level cell segmentations of these data that facilitate development of a variety of forecasting methods. Initial results from a selection of time series and volumetric video modeling approaches achieve better performance than naive baseline methods, but also show room for further improvement. We established a training/validation/test split within the data, and deﬁned and implemented a reference evaluation scheme.
Researcher Affiliation	Collaboration	1Google Research, 2Harvard University, 3HHMI Janelia, 4Radboud University Correspondence to EMAIL, EMAIL. Google Research is an industry affiliation. Harvard University and Radboud University are academic affiliations. HHMI Janelia is a research institute that often collaborates with both. The presence of both 'Google Research' (industry) and 'Harvard University'/'Radboud University' (academia) indicates a collaborative affiliation.
Pseudocode	No	The paper describes various methods like Linear models, Ti DE, TSMixer, Time-Mix, and U-Net using mathematical equations and textual descriptions, for example, 'f Linear a1:C,n, φ = ˆa C+1:C+H,n, (5)'. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format outside of mathematical formulations.
Open Source Code	Yes	Public release of all relevant code, including a web-based viewer for interactively visualizing whole-brain activity at single cell resolution, and training code for all discussed models, available at: google-research.github.io/zapbench. Datasets, all relevant code for the benchmark, and interactive visualizations are available through a dedicated project website: google-research.github.io/zapbench. The code repository for zapbench is at: github.com/google-research/zapbench.
Open Datasets	Yes	Here, we introduce the Zebraﬁsh Activity Prediction Benchmark (ZAPBench) to measure progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet microscopy recordings... ZAPBench provides the full activity data and associated code to make computational modeling of this data highly accessible. Public release of all relevant code... available at: google-research.github.io/zapbench. Raw datasets as well as postprocessed versions are each terabyte-sized. We host them on cloud storage in a format that allows streaming access. Views can be accessed at: google-research.github.io/zapbench.
Dataset Splits	Yes	We established a training/validation/test split within the data... We divide the dataset by stimulus condition, splitting each condition into 70% training data, 10% validation data for model selection, and 20% test data for evaluation. We completely hold-out one condition, TAXIS, and only use it for testing (see Fig. 3B).
Hardware Specification	No	The paper describes the experimental setup for data acquisition, mentioning a 'light-sheet microscope' and 'camera (Orca Flash 4.0 v2, Hamamatsu)'. However, it does not specify any hardware details (like GPU/CPU models, memory) used for running the computational experiments or training the models described.
Software Dependencies	No	All forecasting models used in this paper were implemented in jax (Bradbury et al., 2018). We implemented custom data loaders on top of Grain (Grain developers, 2024), which are easily usable with frameworks such as jax and Py Torch (Ansel et al., 2024). While the paper mentions software frameworks like JAX, Grain, and PyTorch, it only provides publication years for their respective foundational papers (e.g., '(Bradbury et al., 2018)') rather than specific version numbers of the software used (e.g., 'PyTorch 1.9'). Therefore, it does not meet the requirement of providing specific version numbers for key software components.
Experiment Setup	Yes	For the Ti DE model (Das et al., 2023), we use a hidden layer size of 128, 2 encoder and decoder layers, a decoder output dimensionality of 32, and no layer or reversible instance norm. We use the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 10^-3, weight decay of 10^-4, and early stopping on the validation set loss. For TSMixer (Chen et al., 2023), we selected an architecture with 2 blocks, MLP dimension of 256, and no instance norm for C = 4, and 2 blocks, MLP dimension 128, and reversible instance norm for C = 256. We use the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 10^-3, weight decay of 10^-4, and early stopping on the validation set loss. The U-Net downsamples the input video by a factor of 4 in XY... We use three-dimensional convolutions throughout the network... three residual blocks at each resolution, except at the lowest resolution where we use four, and ﬁx 128 features throughout the U-Net... We use the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 10^-4 decayed to 10^-7 over 500,000 steps for C = 4 and 250,000 steps for C = 256.