Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

Authors: Yuan Yuan, Chonghua Han, Jingtao Ding, Guozhen Zhang, Depeng Jin, Yong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Urban Di T effectively captures complex urban spatiotemporal dynamics, achieving state-of-the-art performance across multiple datasets and tasks. It also exhibits powerful zero-shot capabilities, proving its applicability in open-world settings. Urban Di T marks a significant step forward in the advancement of urban foundation models.
Researcher Affiliation Collaboration Yuan Yuan1, Chonghua Han1, Jingtao Ding1, Guozhen Zhang2, Depeng Jin1, Yong Li1,* 1 Department of Electronic Engineering, BNRist, Tsinghua University 2 Tsing Roc.ai Beijing, China *Corresponding author: EMAIL
Pseudocode No The paper describes the training process using a formula in Section 3.4, but it does not include a clearly labeled algorithm block or pseudocode.
Open Source Code Yes Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/Urban Di T.
Open Datasets Yes Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/Urban Di T. We utilize a diverse set of datasets from multiple domains and cities to evaluate urban spatio-temporal applications, which include taxi demand, cellular network traffic, crowd flows, transportation traffic, and dynamic population, reflecting a broad spectrum of urban activities. The datasets are sourced from different cities such as New York City, Beijing, Shanghai, and Nanjing, each representing unique urban characteristics. ... For a detailed summary of the datasets, please refer to Table 4 and Table 5 in Appendix A.
Dataset Splits Yes We split the datasets into training, validation, and testing sets along the temporal dimension, using a 6:2:2 ratio. To ensure no overlap between them, we carefully remove any overlapping points, ensuring clear separation across the temporal splits for evaluation.
Hardware Specification No The paper provides a "Computational Analysis" in Appendix D.4 which discusses training and inference times, but it does not specify the type of compute workers (CPU/GPU models, memory, etc.) used for the experiments.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For Urban Di T-S (small), the model consists of 4 transformer layers with a hidden size of 256. Both the spatial and temporal patch sizes are set to 2, and the number of attention heads is 4. Urban Di T-M (medium) is composed of 6 transformer layers with a hidden size of 384, maintaining the same spatial and temporal patch sizes of 2, and 6 attention heads. Urban Di T-L (large) includes 12 transformer layers, a hidden size of 384, spatial and temporal patch sizes of 2, and 12 attention heads. Each memory pool contains 512 embeddings, with the embedding dimension matching the model s hidden size. The learning rate is set to 1e-4, and the maximum number of training epochs is 500, with early stopping applied to prevent overfitting. The batch size is tailored for each dataset to maintain a similar number of training iterations across them.