Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Authors: Haoyu He, Haozheng Luo, Yan Chen, Qi (Cheems) Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time.
Researcher Affiliation	Academia	Northeastern University Northwestern University EMAIL EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 RHYTHM Overall Pipeline
Open Source Code	Yes	Code is publicly available at https://github.com/he-h/rhythm.
Open Datasets	Yes	We evaluate our approach on three real-world datasets collected from the cities of Kumamoto, Sapporo, and Hiroshima sourced from YJMob100K [74].
Dataset Splits	Yes	Each dataset is divided into training, validation, and test sets based on days, with 70%, 20%, and 10% of the data allocated to each set, respectively.
Hardware Specification	Yes	We perform all experiments using a single NVIDIA A100 GPU with 40GB of memory and a 24-core Intel(R) Xeon(R) Gold 6338 CPU operating at 2.00GHz.
Software Dependencies	No	Our code is developed in Py Torch [52] and utilizes the Hugging Face Transformer Library2 for experimental execution.
Experiment Setup	Yes	Embeddings for time-of-day and day-of-week, the categorical location embedding, and the coordinate projection all use hidden dimensions of 128, 128, 256, and 128, respectively. We use Adam W [41] as the optimizer. For model training, we conduct a systematic hyperparameter search, exploring learning rates from the set {1e-4, 3e-4, 5e-4} and weight decay values from {0, 0.001, 0.01}. Through extensive validation experiments, we determine the optimal configuration for each dataset. All models are trained with a consistent batch size of 64 across all datasets for fair comparison. The final hyperparameter settings are selected based on performance on the validation set.