Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Authors: Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on four real-world datasets demonstrate that Time XL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction. [...] Experiments on four real-world benchmarks show that Time XL consistently outperforms baselines, achieving up to a 8.9% improvement in AUC while providing faithful, human-centric multi-modal explanations.
Researcher Affiliation Collaboration 1School of Computing, University of Connecticut 2Data Science & System Security Department, NEC Labs America 3Kim Jaechul Graduate School of AI, KAIST
Pseudocode Yes Algorithm 1 Iterative Optimization Loop of Time XL
Open Source Code Yes This paper provides code and dataset in the supplementary materials with instructions.
Open Datasets Yes We use the Weather and Healthcare datasets in Time CAP [65], and a Finance dataset extended from [12]. 3https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data 4https://www.indexmundi.com/commodities 5https://www.cdc.gov/fluview/overview/index.html
Dataset Splits Yes All datasets are split 6:2:2 for training, validation, and testing.
Hardware Specification Yes We conducted all the experiments on a Tensor EX server with 2 Intel Xeon Gold 5218R Processor (each with 20 Core), 512GB memory, and 4 RTX A6000 GPUs (each with 48 GB memory).
Software Dependencies Yes We employed the gpt-4o-2024-08-06 version for GPT-4o in Open AI API by default. We use the parameters max_tokens=2048, top_p=1, and temperature=0.7 for content generation (self-reflection and text refinement), and 0.3 for prediction. We keep the same setting for Gemini-2.0-Flash and GPT-4o-mini due to the best empirical performance.
Experiment Setup Yes The numbers of time series prototypes and text prototypes are k {5, 10, 15, 20} and k [5, 10], respectively. The hyperparameters controlling regularization strengths are λ1, λ2, λ3 [0.1, 0.3] with interval 0.05 for individual modality, dmin {1.0, 1.5, 2.0} for time series, dmin {3.0, 3.5, 4.0} for text. Learning rate for multi-modal encoder {0.0001, 0.0003, 0.001}, using Adam [84] as the optimizer. The number of case-based explanations fed to prediction LLM ω {3, 5, 8, 10}.