Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SEMPO: Lightweight Foundation Models for Time Series Forecasting

Authors: Hui He, Kun Yi, Yuanchi Ma, Qi Zhang, Zhendong Niu, Guansong Pang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods.
Researcher Affiliation	Academia	1Beijing Institute of Technology, 2Singapore Management University, 3State Information Center, 4Tongji University
Pseudocode	No	The paper describes the methodology in detail using mathematical formulations and block diagrams (Figure 2), but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at https://github.com/mala-lab/SEMPO.
Open Datasets	Yes	Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO... Datasets. Leveraging large-scale publicly available time series collection UTSD [19]... For zero-/few-shot forecasting, we evaluate SEMPO on the Time-Series-Library (TSLib) benchmark [25]... We also include GIFT-Eval [49]... (Table 4 and 5 lists specific datasets with sources like Monash [57], UCR [58])
Dataset Splits	Yes	We then set the pre-training training-validation split to 9:1, following [19]. For few-shot forecasting, we consider two training scenarios using 5% and 10% of the training data, respectively.
Hardware Specification	Yes	Using 83M pre-training datasets, the entire two-stage pre-training process takes 10 hours on 4 A6000-48G GPUs with BF32 precision and a batch size of 2,048. SEMPO and all other baselines are conducted on 4 NVIDIA A6000-48G GPUs.
Software Dependencies	Yes	Our model is implemented in Pytorch 2.1.2 with Python 3.10 and all the experiments are run on 4 A6000-48G GPUs.
Experiment Setup	Yes	By default, we set layer_number S = 6, head_number=16, latent_dimension Dp = 256, patch_size Lp = 64, prompt_number I = 128, and mask_number NM = 4... Regarding optimization, we use Adam W optimizer with hyperparameters: learning_rate=1e-3, weight_decay=0.1, β1 = 0.9, β2 = 0.95... During energy-aware pre-training, the model is trained for 10 epochs, with a batch size of 2,048. Mo P tuning is performed for 20 epochs, with the same batch size of 2,048. For few-shot and zero-shot settings, the batch size is reduced to 32. Early stopping is applied with a patience of 6.