Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

Authors: Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, Ponnuthurai Suganthan, Xudong Jiang, Savitha Ramasamy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on three different backbones (MOIRAI, MOMENT and UNITS) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods.
Researcher Affiliation	Collaboration	1Nanyang Technological University. 2Institute for Infocomm Research, A*STAR. 3CNRS@CREATE. 4Salesforce AI Research. 5Griffith University. 6Squirrel Ai Learning. 7Qatar University.
Pseudocode	Yes	For clarity, we provide the Pytorch-like pseudo codes of MSFT in Algorithm 1 and Algorithm 2, , illustrating the overall training pipeline and the MSFT attention block described in Section 4.
Open Source Code	Yes	Codes are available at https://github.com/zqiao11/MSFT.
Open Datasets	Yes	For long sequence forecasting (LSF), we conduct experiments on six well-established datasets, including the ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [51], Weather [45], and Electricity [45]. We note that these datasets are not included in the pretraining datasets of the TSFMs we evaluated. The key properties of these LSF datasets are detailed in Table 4. Following Moirai [43], we use 5 out-of-distribution datasets for probabilistic forecasting: Electricity [38], Solar-Power [16], Jena Weather, Istanbul Traffic2, and Turkey Power3. Detailed descriptions of these datasets are provided in Table 5.
Dataset Splits	Yes	We create the training, validation, and test datasets by cropping time series windows with fixed sequence lengths. Given the context and prediction lengths, samples are segmented using a sliding window, where the window size is C + H. The train-val-test split follows the default LSF setup. Data are normalized for LSF but not for PF. ... The test set comprises the final time steps, segmented into multiple non-overlapping evaluation windows. The length of the prediction window and the number of rolling evaluations are tailored for each dataset based on its frequency (see Table 5 for details).
Hardware Specification	Yes	Our experiments are conducted on a server equipped with an AMD EPYC 7763 CPU (64 cores, 128 threads) and four NVIDIA A40 GPUs, each with 40 GB of memory. ... The experiments in this section are exclusively conducted on another server equipped with a 12 v CPU Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz and a single RTX 3080 GPU with 20GB of memory.
Software Dependencies	No	We use the Adam W optimizer with weight decay=0.1, β1 = 0.9, and β2 = 0.98 for optimization. ... For Moirai and Moment, we directly adopt the PEFT library [22] for both Lo RA and Ada Lo RA. ... For Moirai, the evaluation is based on the Gluon TS Library [1].
Experiment Setup	Yes	We use the Adam W optimizer with weight decay=0.1, β1 = 0.9, and β2 = 0.98 for optimization. Specifically, unlike pretraining, which uses a learning rate of 1e-3, we find that finetuning requires a much smaller learning rate. Based on validation performance, we select a learning rate of either 5e-6 or 5e-7 for finetuning our models. The batch size is set to 512 by default for experiments using MOIRAISmall, and reduced to 256 on MOIRAIBase if GPU memory reaches its limit. We adopt a constant learning rate scheduling, and early stopping is employed to monitor training. The context lengths are used directly from the values in the original Moirai models, which are tuned from a range of [1000, 2000, 3000, 4000, 5000]. The patch sizes are also taken from their provided values, which are selected based on data frequency. For Moment and UNITS, we directly follow their provided their original finetuning configurations for experiments, with the learning rate selected from 5e-5, 5e-6, or 5e-7.