Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment

Authors: Chenxi Liu, Qianxiong Xu, Hao Miao, Sun Yang, Lingzheng Zhang, Cheng Long, Ziyue Li, Rui Zhao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on eight real datasets demonstrate that Time CMA outperforms state-of-the-arts. Experiments Datasets. We conduct experiments on eight datasets: ETTm1, ETTm2, ETTh1, ETTh2 (Zeng et al. 2023), ECL (Asuncion and Newman 2007), FRED-MD (Mc Cracken and Ng 2016), ILI and Weather (Wu et al. 2021). Baselines and Evaluation. We evaluate seven baseline models across five categories:... Main Results. Table 1 illustrates the average performance of Time CMA outperforms all baselines in all cases. Ablation Studies of Model Design. Fig. 3 indicates the ablation studies of model design...
Researcher Affiliation	Collaboration	Chenxi Liu1, Qianxiong Xu1*, Hao Miao2, Sun Yang3, Lingzheng Zhang4, Cheng Long1, Ziyue Li5 , Rui Zhao 6 1S-Lab, Nanyang Technological University, Singapore 2Aalborg University, Denmark 3Peking University, China 4HKUST (Guangzhou), China 5University of Cologne, Germany 6Sense Time Research, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology with detailed explanations of 'Dual-Modality Encoding', 'Cross-Modality Alignment', and 'Time Series Forecasting' sections, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Chenxi Liu-HNU/Time CMA
Open Datasets	Yes	Datasets. We conduct experiments on eight datasets: ETTm1, ETTm2, ETTh1, ETTh2 (Zeng et al. 2023), ECL (Asuncion and Newman 2007), FRED-MD (Mc Cracken and Ng 2016), ILI and Weather (Wu et al. 2021).
Dataset Splits	No	The paper mentions using eight datasets (ETTm1, ETTm2, ETTh1, ETTh2, ECL, FRED-MD, ILI, Weather) and specifies 'The input sequence length is 36 for the Illness and FRED datasets and 96 for others.' and 'The test batch size is set to 1 for all methods to guarantee fairness during testing.'. However, it does not provide specific training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit predefined splits).
Hardware Specification	Yes	Each experiment is repeated at least three times with different seeds on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using 'GPT-2 as the LLM' and refers to Transformer-based models, but does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn, etc.).
Experiment Setup	Yes	The input sequence length is 36 for the Illness and FRED datasets and 96 for others. The test batch size is set to 1 for all methods to guarantee fairness during testing. Each experiment is repeated at least three times with different seeds on NVIDIA A100 GPUs. The evaluation metrics are mean square error (MSE) and mean absolute error (MAE). ... To ensure fairness of memory, we set the training batch size to 8, thus each iteration has 8 samples. The loss function of Time CMA contains two parts: a prediction loss Lpre and a regularization loss Lreg. We combine them and the overall loss is as follows, Ltask = Lpre + λLreg, where λ is a weight to trade off the prediction and regularization losses. We use Mean Squared Error as the prediction loss, i.e., Lpre = 1 M PM M=1( b XM XM)2, and Lreg is L2 regularization.