Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Deep Continuous-Time State-Space Models for Marked Event Sequences

Authors: Yuxin Chang, Alex Boyd, Cao (Danica) Xiao, Taha Kass-Hout, Parminder Bhatia, Padhraic Smyth, andrew warrington

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, S2P2 achieves state-of-the-art predictive likelihoods across eight real-world datasets, delivering an average improvement of 33% over the best existing approaches. We empirically evaluate our model at scale on a range of metrics across eight real-world datasets, finding that S2P2 matches or exceeds the average predictive performance of baselines, achieving either best- or second-best average performance on all six metrics.
Researcher Affiliation	Collaboration	Yuxin Chang University of California, Irvine Alex Boyd GE Health Care Cao Xiao GE Health Care Taha Kass-Hout GE Health Care Parminder Bhatia GE Health Care Padhraic Smyth University of California, Irvine Andrew Warrington GE Health Care EMAIL, EMAIL EMAIL
Pseudocode	Yes	In Algorithms 1 to 3, we explicitly detail how to use a parallel scan to compute the sequence of right limits at events, how to then evolve those to compute left limits, and then how to subsequently compute the log-likelihood of the sequence.
Open Source Code	Yes	Our model is fully integrated into the Easy TPP [Xue et al., 2023] library [link]. Other code changes to reproduce our results can be found in our forked repository [link]. Model checkpoints are available on request.
Open Datasets	Yes	Datasets: We compare models on eight different datasets, including five datasets available from Easy TPP [Xue et al., 2023] (Amazon, Retweet, Taxi, Taobao and Stack Overflow). We also add two commonly used datasets in the literature (Last.fm and MIMIC-II), as well as a new medical events dataset derived from the publicly available EHRSHOT dataset [Wornow et al., 2023].
Dataset Splits	Yes	We use the default train/validation/test splits for Easy TPP benchmark datasets. For MIMIC-II, we copy Du et al. [2016] and keep the 325 test sequences in the test split, and further split the 2,935 training sequences into 2,600 for training and 325 for validation. In our pre-processed datasets, Last.fm and EHRSHOT, we randomly partition into subsets containing 70%, 15%, 15% of all sequences for training/validation/test respectively.
Hardware Specification	Yes	All models were trained on a single 24GB NVIDIA A5000 GPU.
Software Dependencies	No	All baseline models used up-to-date Py Torch implementations, provided by the Easy TPP library [Xue et al., 2023] as of May 2025. Our Easy TPP Py Torch S2P2 is written in pure Py Torch... We therefore include the runtimes of a standalone JAX S2P2 implementation...
Experiment Setup	Yes	We apply a grid search for all models on all datasets for hyperparameter tuning. We use a default batch size of 256 for training. For models/datasets that require more memory (e.g., large mark space or long sequences), we reduce the batch size and keep them as consistent as possible among all the models on each dataset. We use the Adam stochastic gradient optimizer [Kingma and Ba, 2015], with a learning rate of 0.01 and a linear warm-up schedule over the first 1% iterations, followed by a cosine decay. Initial experiments showed this setting generally worked well across different models and datasets leads to convergence within 300 epochs. We also clip the gradient norm to have a max norm of 1 for training stability. We use Monte-Carlo samples to estimate the integral in log-likelihood, where we use 10 Monte-Carlo points per event during training.