Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Investigating Hallucinations of Time Series Foundation Models through Signal Subspace Analysis

Authors: Yufeng Zou, Zijian Wang, Diego Klabjan, Han Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in 5.2 demonstrate that while the forecasting performance of TSFMs suffers from hallucinations, our test-time intervention effectively mitigates hallucinations and improves the quality of forecasts, yielding up to 6.62% reduction in the hallucination rate, 93.83% gain in R2, and 13.52% gain in correlation. Moreover, the signal strength measure we propose has strong predictive power of both hallucinations and forecasting performance of the model. Our work contributes to deeper understanding of TSFM trustworthiness that could foster future research in this direction.
Researcher Affiliation	Academia	Department of Computer Science, Department of Industrial Engineering and Management Sciences, Department of Statistics and Data Science, Northwestern University School of Computer Science, The University of Sydney EMAIL EMAIL EMAIL
Pseudocode	Yes	C SSIM Algorithm Algorithm 1 details the full procedures of SSIM. Algorithm 1: SSIM: Signal Subspace Intervention through Magnification
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Not applicable.
Open Datasets	Yes	We adopt read-world datasets from the GIFT-Eval [3] benchmark covering various domains. We take a fixed number of final observations from each time series, dividing them into context and ground truth of fixed lengths.
Dataset Splits	Yes	Each dataset is randomly split into validation (20%) and test (80%) sets.
Hardware Specification	Yes	All experiments are conducted on the Ubuntu 22.04.4 LTS operating system, 16 Intel(R) Core(TM) i7-7820X CPUs, and 4 NVIDIA Ge Force RTX 2080 Ti GPUs, with the framework of Python 3.11.9 and Py Torch 1.12.1.
Software Dependencies	Yes	All experiments are conducted on the Ubuntu 22.04.4 LTS operating system, 16 Intel(R) Core(TM) i7-7820X CPUs, and 4 NVIDIA Ge Force RTX 2080 Ti GPUs, with the framework of Python 3.11.9 and Py Torch 1.12.1. We adopt the OLS and ARMA [6] implementations in the statsmodels package. We adopt the STFT [18] implementation in the Sci Py package5, with unsymmetrical Parzen windows and hop=1.
Experiment Setup	Yes	We set the context length to 500 and the forecasting horizon to 64 for zero-shot time series forecasting in our main experiments, using the base versions of Chronos and Chronos-Bolt together with Times FM-2.0. As Chronos produces probabilistic forecasts, we set the number of decoding samples to 1 and fix the random seed to ensure reproducibility. We set the frequency configuration of Times FM to 0. For hallucination check, we set the tolerance thresholds δ of the trend, frequency, pattern, and ARMA rules to 0.25, 0.5, 0.5, and 0.25, respectively, based on validation. For SSIM, we perform grid search for the proportion of selected top neurons ϵ {0.1, 0.2, 0.3, 0.4, 0.5} and set it to 0.1 for Chronos and Times FM and 0.2 for Chronos-Bolt based on validation. For baselines methods, we denoise input time series using the mean of sliding windows of size 5. We perturb input time series by Gaussian noise with a standard deviation of 0.05 times that of the input for 10 runs with different random seeds.