Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting

Authors: Wei Chen, Yuxuan Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.
Researcher Affiliation Academia Wei Chen, Yuxuan Liang INTR & DSA Thrust, The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode Yes For clarity, we provide a Algorithm workflow 1 and Pytorch-Style Pseudocode 2 in Appendix C.1. For clarity, we provide a Algorithm workflow 3 and Pytorch-Style Pseudocode 4 in Appendix C.2.
Open Source Code Yes Our code repository is available at https://github.com/Onedean/ST-TTC.
Open Datasets Yes We employ publicly available benchmark datasets widely used in the literature to cover typical spatio-temporal forecasting scenarios in the traffic domain (PEMS-03, PEMS-04, PEMS-07, PEMS-08 [65]), the meteorological domain (Know Air [79]), and the energy domain (Urban EV [39]). In addition, we also leverage the traffic-speed benchmark METR-LA [40], the large-scale spatiotemporal benchmark Large ST [48], and dynamic-stream benchmarks (Energy-Stream, Air-Stream, PEMS-Stream [10]) to assess our methods across varied settings and learning paradigms.
Dataset Splits Yes Unless otherwise specified, all datasets are chronologically split into training, validation and test sets in a 6 : 2 : 2 ratio. For more detailed description of each dataset, please see the Appendix D.1.
Hardware Specification Yes All experiments are conducted on a Linux server equipped with a 1 AMD EPYC 7763 128-Core Processor CPU (256GB memory) and 4 NVIDIA RTX A6000 (48GB memory) GPUs.
Software Dependencies No The paper mentions "Pytorch-Style Pseudocode 2" and "Pytorch-Style Pseudocode 4" implying the use of PyTorch, but does not provide specific version numbers for PyTorch, Python, or CUDA, which are necessary for reproducible software dependency details.
Experiment Setup Yes For our paper, except for the robustness study section, all other experimental hyper-parameters are set uniformly: the learning rate lr is set to 1e-4, the memory-queue sample count n used for updating is 1, and the number of groups m to 4. To ensure fairness, each experiment is repeated five times, with results reported as mean standard deviation (denoted in gray ). More protocol details, see Appendix D.3. ... All experiments are conducted on a Linux server equipped with a 1 AMD EPYC 7763 128-Core Processor CPU (256GB memory) and 4 NVIDIA RTX A6000 (48GB memory) GPUs. To carry out benchmark testing experiments, all baselines are set to run for a duration of 100 150 epochs by default (depends on the corresponding paper), with specific timings contingent upon the method with early stop mechanism. The number of early stopping steps is set to 10.