Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Semantic Parsing Framework for End-to-End Time Normalization

Authors: Xin Su, Sungduk Yu, Phillip Howard, Steven Bethard

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.
Researcher Affiliation	Collaboration	Xin Su Intel EMAIL Sungduk Yu Oracle EMAIL Phillip Howard Thoughtworks EMAIL Steven Bethard University of Arizona EMAIL
Pseudocode	No	The paper describes the methodology in Section 3 and provides API references and usage examples in Appendix B ("SCATE Prompt") but does not include any explicitly labeled pseudocode or algorithm blocks for its proposed methods like data augmentation or model training.
Open Source Code	Yes	Work done while at Intel. Code and models are available at https://github.com/clulab/normit
Open Datasets	Yes	Temp Eval-2013 [Uz Zaman et al., 2013] data has been annotated with publicly available SCATE annotations, including training, development, and test sets. [...] We randomly sample 10k sentences from the CC-News [Mackenzie et al., 2020] dataset widely used in large language model pretraining as our source for data augmentation.
Dataset Splits	Yes	The resulting dataset includes 557 annotated SCATE code block in the training set and 313 in the test set.
Hardware Specification	Yes	For local model training, we train both Qwen/Qwen2.5-0.5BInstruct [Team, 2024] and T5-Large [Raffel et al., 2020] on a single NVIDIA 80GB A100 GPU with 5 epochs and batch size 64. [...] We conduct runtime measurements on the test set using an NVIDIA RTX 3090 GPU with v LLM [Kwon et al., 2023] for efficient inference.
Software Dependencies	No	The paper mentions implementing a SCATE Python library, utilizing the dateutil library, and accessing LLMs through cloud-based APIs (Azure Open AI, Amazon Bedrock, Google Cloud Platform), but it does not specify version numbers for Python or any specific Python libraries or frameworks used for development or training.
Experiment Setup	Yes	For local model training, we train both Qwen/Qwen2.5-0.5BInstruct [Team, 2024] and T5-Large [Raffel et al., 2020] on a single NVIDIA 80GB A100 GPU with 5 epochs and batch size 64. The learning rates are 2 10 5 for Qwen and 5 10 5 for T5.