Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hybrid Latent Representations for PDE Emulation

Authors: Ali Can Bekar, Siddhant Agarwal, Christian Hüttig, Nicola Tosi, David Greenberg

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7 Experiments Cost vs. Accuracy of Neural Emulators We first evaluated HREs and baselines on emulation tasks at target resolution 322 (r = 64). Where possible, we trained emulators at multiple rollout and encoding resolutions to obtain multiple points on their cost-accuracy curves (Fig. 1d). For all tasks, at 642 rollout resolution HRE rollouts were equally or more correlated to reference simulations than all baselines. Similarly, HREs with 322 encoding resolution were more accurate than all baselines except m Unet at 64 64 rollout resolution, which was slower. Only m Unet matched or approached the accuracy of HREs on any task, though Dil Res Net was sometimes comparable for shorter rollouts. When comparing methods at a common rollout or encoding resolution of 322, the gap between HRE and baselines widened: HREs had 30-50% lower RMSE after 64 time steps (Table 1).
Researcher Affiliation	Academia	Ali Can Bekar Helmholtz Centre Hereon Geesthacht, Germany EMAIL Siddhant Agarwal Helmholtz Centre Hereon Geesthacht, Germany EMAIL Christian Hüttig Institute of Space Research German Aerospace Center (DLR) Berlin, Germany EMAIL Nicola Tosi Institute of Space Research German Aerospace Center (DLR) Berlin, Germany EMAIL David S. Greenberg Helmholtz Centre Hereon Geesthacht, Germany EMAIL
Pseudocode	No	The paper describes methods and architectures in text and mathematical equations (e.g., Section 3, Section 4), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/alicanbekar/hres.
Open Datasets	No	We generated 5 datasets with numerical solvers chosen to respect physical laws of the integrated PDEs. Each dataset employs a fixed time step δ longer than the solver s internal time step; we choose δ based on previous studies where possible (details in appendix C). Table 10 lists all model-dataset combinations we employed. We generate 128 experiments using 2048 2048 cells for the solver. There is no explicit statement of open access for these generated datasets.
Dataset Splits	Yes	Training, validation and testing data consisted of nonoverlapping sets of samples drawn from the same distributions (we use non-overlapping sets of random initialization keys for the training, validation, and testing datasets when generating data from JAX solvers), except in cases where we specifically measure generalization of trained emulators to out-of-distribution data (Table 9). (From Table 4): VALIDATION FRACTION 0.1, # TRAJECTORIES TEST 20, # TRAJECTORIES TRAIN 200.
Hardware Specification	Yes	For the ID case at rollout resolution 642 and rollout length 16, m Unet requires 7 hours to train versus 9.5 hours for HRE on 8 A100s with batch size 16 per GPU. DINo trains in 7 hours on a single V100 at the same input resolution with batch size 32. Finetuning for DPOT-S takes 1 hour on an H100.
Software Dependencies	Yes	We use the official FNO implementation from [Kossaifi et al., 2024] (version 0.3.0), with 4 Fourier layers including lifting, filtering and projection, and hidden layers with 128 channels.
Experiment Setup	Yes	We use the Adam W optimizer [Loshchilov and Hutter, 2019] except on DINo which uses Adam [Kingma and Ba, 2015]. A cosine scheduler decays learning rate from 10 4 to 10 6 [Loshchilov and Hutter, 2017] for our HREs, m Unet, FNO and Dil Res Net, while DINo and Fact Former used their original schedulers. Further training details appear in appendix A.1. (From Table 2): BATCH SIZE (PER GPU) 8/16, INITIAL LEARNING RATE 1e 4, OPTIMIZER ADAMW, WEIGHT DECAY 0.01, LR SCHEDULER COSINE 1e 6, CURRICULUM [1, 2, 4, 8, 16], TRAINING EPOCHS 100, EARLY STOPPING (75 EP.) YES.