Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Transferability Metrics for Time Series Foundation Models

Authors: Weiyang Zhang, Xinyang Chen, Xiucheng Li, Kehai Chen, Weili Guan, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive benchmarking across 5 distinct downstream tasks, our method demonstrates superior capability in identifying optimal pre-trained models from heterogeneous model pools for transfer learning. Compared to the state-of-the-art method ETran, our approach improves the weighted Kendall s τw across 5 downstream tasks by 35%.
Researcher Affiliation Academia 1School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) 2School of Information Science and Technology, Harbin Institute of Technology (Shenzhen) EMAIL EMAIL
Pseudocode Yes Algorithm 1 Power Iteration Input: Matrix H, initial vector v0, number of iterations k, here we set 10. Output: Dominant eigenvalue λmax, corresponding eigenvector vk Initialize v0 randomly for i = 1 to k do vi+1 = Hvi Normalize vi+1 to unit length end for Estimate the dominant eigenvalue λmax = v T k Hvk v T k vk
Open Source Code Yes The code is available at https://github.com/TEMPLATE.
Open Datasets Yes Specifically, We verify all methods on 9 multivariate datasets from the UEA classification archive [29]... For long-term forecasting, we use seven widely recognized long-term time series forecasting datasets [31]... For short-term forecasting, we adopt the M4 dataset [49]... We compare 5 widely used anomaly detection benchmarks: SMD [50], MSL [51], SMAP [51], SWa T [52], and PSM [53]
Dataset Splits Yes We provide detailed descriptions of the datasets in Tables 8. For all 5 downstream tasks, we follow the experimental setup of [34]. ... The dataset size is organized in (Train, Validation, Test). ETTm1, ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) Electricity (15 mins)
Hardware Specification Yes The fine-tuning experiments of the pre-trained models were conducted on an NVIDIA H20 GPU with 96GB of memory. ... All the results of pre-trained model transferability evaluation metrics were obtained on an AMD EPYC 7513 32-Core CPU.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies or libraries used in their experiments. It mentions using 'fine-tuned the pre-trained models through hyperparameter grid search' and 'follow the experimental setup of [34]' but does not detail the software stack with versions.
Experiment Setup Yes To compute the transfer performance values, we carefully fine-tuned the pre-trained models through hyperparameter grid search. As [54] highlighted, learning rate and weight decay are the two most critical parameters. Therefore, we performed grid search over learning rates and weight decay values (6 learning rates ranging from 10 3 to 10 5 , and 3 weight decay values from 10 3 to 10 5) to select the optimal hyperparameters.