Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Factorize Spatio-Temporal Foundation Models

Authors: Siru Zhong, Junjie Qiu, Yangyu Wu, Xingchen Zou, Zhongwen Rao, Bin Yang, Chenjuan Guo, Hao Xu, Yuxuan Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive forecasting experiments show that in few-shot settings, Facto ST reduces MAE by up to 46.4% versus Uni ST, uses 46.2% fewer parameters, achieves 68% faster inference than Open City, and remains competitive with expert models. In our experiments, we aim to address the following research questions (RQ): RQ1: Can Facto ST outperform prior approaches (including STGNNs, STFMs and other existing models) under few-shot and zero-shot scenaries? Sec. 4.1 & Sec. 4.2. RQ2: Which model component is critical to the final performance? Sec. 4.3.1. RQ3: How is the data and computation efficiency of Facto ST? Sec. 4.3.2 & Sec. 4.3.3. RQ4: Can we provide interpretability of the domain adaptation process in Facto ST? Sec. 4.3.4.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology (Guangzhou), 2Huawei 2012 Laboratories, 3East China Normal University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes B.2 Pseudocode of Facto ST For reproducibility, we present the detailed pseudocode of Facto ST in Algorithm 1, which concisely summarizes the two-stage learning paradigm: universal temporal pretraining (UTP) followed by lightweight spatio-temporal adaptation (STA). Algorithm 1 FACTOST: Factorized Spatio-Temporal Foundation Model
Open Source Code Yes We will release our code to ensure faithful reproduction of the main results upon the paper s acceptance. This paper follows CC 4.0, and the code is in an anonymized URL.
Open Datasets Yes We pretrain the temporal backbone on diverse ST datasets using Monash [16], covering six domains (energy, nature, health, transport, web, economics) with 130M observations across multiple spatial nodes and sampling frequencies from 4 seconds to daily. During pretraining, we extract and process univariate time series per node independently to prevent data leakage. For evaluation, we use eight established ST benchmarks traffic flow (PEMS03/04/07/08), speed (PEMS-BAY, METR-LA), energy (Electricity), temperature (ETTh2), and climate (Weather) which vary widely in spatial scale (21 883 nodes), temporal resolution (5 min 1 h), and sequence length (17k 52k steps), enabling a comprehensive assessment of cross-domain and multi-scale generalization (see A.1.1 for details).
Dataset Splits Yes Setting. We evaluate few-shot adaptation using only 10% of labeled training data under two forecasting horizons: short-term (12 12) and long-term (96 96), following standard protocols [49, 48]. In the short-term setting, MAE drops from 25.96 (1% data) to 17.54 (10% data) already approaching full-shot performance (16.59).
Hardware Specification No We implement Facto ST using Py Torch, and all experiments are conducted on high-performance GPU servers.
Software Dependencies No We implement Facto ST using Py Torch, and all experiments are conducted on high-performance GPU servers.
Experiment Setup Yes A.1.2 Model Architecture and Hyperparameters We implement Facto ST using Py Torch, and all experiments are conducted on high-performance GPU servers. The architecture consists of three encoder layers and three decoder layers, with 16 attention heads and a latent dimension d = 128. Input sequences are processed using a patching mechanism with a patch size of 12, and the dropout rate is set to 0.2 to prevent overfitting. The feed-forward network within each Transformer layer has a hidden dimension of 512. Pretraining. During pretraining, we use the Adam optimizer with an initial learning rate of 5 10-4, and apply Step LR to decay the learning rate by a fixed factor every few epochs, improving convergence. The model is equipped with Np = 8 domain prompt learning vectors, each of dimension 128, and in supervised prediction tasks, both the input length and target forecasting horizon are fixed at 96 (The length can be set to any value, which is the maximum supported step length, here we set 96 for downstream comparison). For spectral consistency modeling, the number of augmented patches is set to Kf = 4. Pretraining is performed with a large batch size of 16,384 to ensure stable optimization. Fine-tuning. During fine-tuning, we adopt a learning rate of 1 10-3. The lookback window is set to 12 (short-term) or 96 (long-term), with matching prediction horizons. The number of domain prompt tokens (Np = 3) and patching configuration remain unchanged from pretraining. A top-k selection (k = 3) is applied during domain prompt matching to enhance generalization. Additional configuration includes memory replacement ratio of 0.3; memory size of 0.2 relative to total capacity; spatio-temporal identifier embedding dimension of 32; maximum delay step = 3.