Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CoastalBench: A Decade-Long High-Resolution Dataset to Emulate Complex Coastal Processes

Authors: Zelin Xu, Yupu Zhang, Tingsong Xiao, Maitane Olabarrieta Lizaso, Jose M. Gonzalez-Ondina, Zibo Liu, Shigang Chen, Zhe Jiang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated a customized Vision Transformer model that takes initial and boundary conditions and external forcings and predicts ocean variables at varying lead times. The dataset provides an opportunity to benchmark novel deep learning models for high-resolution coastal simulations (e.g., physics-informed machine learning, neural operator learning). Through experimental results, we demonstrate the promising performance of the deep learning model by comparing it with ROMS simulations. An ablation study validates the importance of components, while a scaling test analyzes the impact of model size on predictive accuracy.
Researcher Affiliation	Academia	1Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL, USA 2Department of Civil & Coastal Engineering, University of Florida, Gainesville, FL, USA. Correspondence to: Zhe Jiang <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training strategy in prose and diagrams (Figure 2), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	The code and dataset can be accessed at https://github.com/spatialdatasciencegroup/ Coastal Bench.
Open Datasets	Yes	To fill this gap, we introduce a decade-long, high-resolution (<100m) coastal circulation modeling dataset on a real-world 3D mesh in southwest Florida with around 6 million cells. The dataset contains key oceanography variables (e.g., current velocities, free surface level, temperature, salinity) alongside external atmospheric and river forcings. We release Coastal Bench, a decade-long, high-resolution dataset designed for modeling complex coastal processes. The code and dataset can be accessed at https://github.com/spatialdatasciencegroup/ Coastal Bench.
Dataset Splits	Yes	We split the dataset into training, validation, and test sets chronologically. Specifically, the first eight years of data are used for training, while the 9th and 10th years are used for validation and testing, respectively.
Hardware Specification	Yes	Training is conducted on 8 NVIDIA A100 80GB GPUs using Py Torch s Distributed Data Parallel framework, with a per-GPU batch size of 1. Results show that our Vi T model reduces the runtime of ROMS for a 72-hour forecast from 2,477 seconds (with 512 CPU cores on AMD EPYC 7742 64-Core Processors) to 34.14 seconds (on a single A100 GPU), achieving over a 70 speedup.
Software Dependencies	No	Training is conducted on 8 NVIDIA A100 80GB GPUs using PyTorch’s Distributed Data Parallel framework, with a per-GPU batch size of 1. While PyTorch is mentioned, no specific version number or other software dependencies with version numbers are provided.
Experiment Setup	Yes	The model follows the Vi T-Base (Dosovitskiy et al., 2020) configuration with a hidden dimension of 768, depth of 12, and 12 attention heads, using a patch size of 4. Evaluation metrics include Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (r) to assess prediction accuracy and correlation with ground truth. Training is conducted on 8 NVIDIA A100 80GB GPUs using Py Torch s Distributed Data Parallel framework, with a per-GPU batch size of 1. The training process consists of two stages: initial training with one-step prediction, followed by fine-tuning with K = 4 autoregressive steps for improved long-term stability. Lead times are sampled from {0.5, 3, 12} hours to enhance generalization across different temporal scales.