Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Time Series Forecasting via Instance-aware Post-hoc Revision

Authors: Zhiding Liu, Mingyue Cheng, Guanhao Zhao, Jiqian Yang, Qi Liu, Enhong Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on real-world datasets with mainstream forecasting models demonstrate that PIR effectively mitigates instance-level errors and significantly improves forecasting reliability. Our code is available2
Researcher Affiliation	Academia	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual descriptions, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Our code is available2 https://github.com/icantnamemyself/PIR
Open Datasets	Yes	For the long-term forecasting task, we conduct experiments on a widely recognized benchmark dataset that includes eight real-world datasets spanning diverse domains [54, 21]. Additionally, we incorporate the PEMS dataset, which contains four subsets, for the short-term forecasting task [29]. The exogenous information used in these datasets are the available timestamps. We also conduct experiments on datasets with additional textual descriptions in the Appendix. Following standard experimental protocols, we split each dataset into training, validation, and testing sets in chronological order. The split ratios are set to 6:2:2 for the ETT dataset and 7:1:2 for the other datasets. Detailed information about the datasets is available in the Appendix. [...] ETT3: The dataset records oil temperature and load metrics from electricity transformers, tracked between July 2016 and July 2018. It is subdivided into four mini-datasets, with data sampled either hourly or every 15 minutes. Electricity4: The dataset captures the hourly electricity consumption in kWh of 321 clients, monitored from July 2016 to July 2019. Solar5: The dataset records the solar power production in the year of 2006, which is sampled every 10 minutes from 137 PV plants in Alabama State. Weather6: The dataset records the 21 weather indicators, including air temperature and humidity every 10 minutes from the Weather Station of the Max Planck Biogeochemistry Institute in 2020. Traffic7: The dataset provides the hourly traffic volume data describing the road occupancy rates of San Francisco freeways, recorded by 862 sensors. PEMS8: The dataset is a series of traffic flow dataset with four subsets, including PEMS03, PEMS04, PEMS07, and PEMS08. The traffic information is recorded at a rate of every 5 minutes by multiple sensors. Energy and Health9: These two datasets are subsets of Time-MMD [27], a multimodal time series dataset that ensures fine-grained alignment between textual and numerical modalities. The datasets are collected weekly, spanning from 1996 and 1997 up to May 2024, respectively.
Dataset Splits	Yes	Following standard experimental protocols, we split each dataset into training, validation, and testing sets in chronological order. The split ratios are set to 6:2:2 for the ETT dataset and 7:1:2 for the other datasets.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	We employ the ADAM optimizer as the default optimization algorithm across all experiments and evaluate performance using two metrics: mean squared error (MSE) and mean absolute error (MAE). For the PIR framework, the retrieval number K is tuned from the set {10, 20, 50}, and the weight hyperparameter λ is fixed at 1. All experiments are conducted on a single NVIDIA RTX 4090 GPU. [...] The results indicate that the retrieval stage introduces negligible additional latency on both datasets, thanks to the GPU-parallelizable nature of cosine similarity. For even larger datasets, the total computational cost can be further reduced by applying sampling strategies (e.g., stride sampling) or dimensionality reduction techniques to limit the search space. In contrast, the LSH-based retrieval implemented with the faiss library yields significantly higher inference time without performance gains, indicating that brute-force cosine similarity is both more efficient and effective in our current implementation.
Experiment Setup	Yes	We employ the ADAM optimizer as the default optimization algorithm across all experiments and evaluate performance using two metrics: mean squared error (MSE) and mean absolute error (MAE). For the PIR framework, the retrieval number K is tuned from the set {10, 20, 50}, and the weight hyperparameter λ is fixed at 1. All experiments are conducted on a single NVIDIA RTX 4090 GPU. [...] Following the standard evaluation protocol [62, 56], we set the input series length Lin = 96 across all datasets. For a unified comparison, the target series length Lout is set to {12, 24, 36, 48} for the PEMS dataset and {96, 192, 336, 720} for the remaining datasets.