Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Authors: Haoyu He, Haozheng Luo, Yan Chen, Qi (Cheems) Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time.
Researcher Affiliation Academia Northeastern University Northwestern University EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 RHYTHM Overall Pipeline
Open Source Code Yes Code is publicly available at https://github.com/he-h/rhythm.
Open Datasets Yes We evaluate our approach on three real-world datasets collected from the cities of Kumamoto, Sapporo, and Hiroshima sourced from YJMob100K [74].
Dataset Splits Yes Each dataset is divided into training, validation, and test sets based on days, with 70%, 20%, and 10% of the data allocated to each set, respectively.
Hardware Specification Yes We perform all experiments using a single NVIDIA A100 GPU with 40GB of memory and a 24-core Intel(R) Xeon(R) Gold 6338 CPU operating at 2.00GHz.
Software Dependencies No Our code is developed in Py Torch [52] and utilizes the Hugging Face Transformer Library2 for experimental execution.
Experiment Setup Yes Embeddings for time-of-day and day-of-week, the categorical location embedding, and the coordinate projection all use hidden dimensions of 128, 128, 256, and 128, respectively. We use Adam W [41] as the optimizer. For model training, we conduct a systematic hyperparameter search, exploring learning rates from the set {1e-4, 3e-4, 5e-4} and weight decay values from {0, 0.001, 0.01}. Through extensive validation experiments, we determine the optimal configuration for each dataset. All models are trained with a consistent batch size of 64 across all datasets for fair comparison. The final hyperparameter settings are selected based on performance on the validation set.