Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

Authors: Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, Shu-Tao Xia

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on eight real-world datasets demonstrate that CALF achieves state-of-the-art performance on both long-term and short-term time series forecasting tasks, with favorable generalization ability and low computational complexity.
Researcher Affiliation	Academia	Peiyuan Liu1,, Hang Guo1,, Tao Dai2, , Naiqi Li1, , Jigang Bao1, Xudong Ren1, Yong Jiang1, Shu-Tao Xia1,3 1Tsinghua Shenzhen International Graduate School 2Shenzhen University 3Pengcheng Laboratory EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using text, mathematical equations (e.g., equations 1-6), and a system diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Hank0626/CALF
Open Datasets	Yes	We conduct experiments on seven widely-used real-world datasets, including the Electricity Transformer Temperature (ETT) dataset with its four subsets (ETTh1, ETTh2, ETTm1, ETTm2), Weather, ECL, and Traffic (Wu et al. 2021). ... We adopt the M4 datasets (Makridakis, Spiliotis, and Assimakopoulos 2018), which comprise univariate marketing data collected yearly, quarterly, and monthly.
Dataset Splits	Yes	The input time series length T is fixed as 96 for a fair comparison, and we adopt four distinct prediction horizons H {96, 192, 336, 720}. ... In few-shot learning, only a small ratio of the training data is utilized. Specifically, for each dataset, we utilize only the first 10% of the training data.
Hardware Specification	No	The paper mentions 'low computational complexity' and conducts an 'Efficiency Analysis' (Table 5 showing 'Time (s)'), but it does not specify any particular hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using a 'pre-trained GPT2 based model' and the 'Adam optimizer', but it does not provide specific version numbers for any software, libraries, or programming languages used in the implementation.
Experiment Setup	Yes	Implementation Details. Following (Zhou et al. 2023), we use pre-trained GPT2 based model (Radford et al. 2019) with the first 6 Transformer layers as our backbone. Optimization is conducted using the Adam optimizer (Kingma and Ba 2014), with a learning rate of 0.0005. For the total loss function, we set the hyper-parameters γ = 0.8, λ1 = 1 and λ2 = 0.01. In terms of loss functions for long-term forecasting, we apply L1 loss across all three loss types for ETT datasets, while for the other three datasets, smooth L1 loss is utilized. For short-term forecasting, we compute supervised loss with SMAPE, modal consistency loss with MASE, and feature regularization loss with smooth L1 loss, respectively.