Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MIRA: Medical Time Series Foundation Model for Real-World Health Data

Authors: Hao Li, Bowen Deng, Chang Xu, ZhiYuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 8% and 6% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.
Researcher Affiliation	Collaboration	1 Microsoft Research 2 University of Manchester 3 Peking University 4 Tsinghua University 5 Nanjing University 6 Imperial Global Singapore, Imperial College London
Pseudocode	Yes	Algorithm 1 Neural ODE State Transition
Open Source Code	Yes	Our code is available at Microsoft/MIRA.
Open Datasets	Yes	All data are drawn from publicly available clinical datasets, including MIMIC-III [73], MIMIC-IV [74], PTB-XL [75], Sleep-EDF [76], and the WAVES Pediatric Waveform Database [77].
Dataset Splits	Yes	For comparison, we fine-tuned the full-shot forecasting models on the training split of each benchmark, while the zero-shot foundation models were evaluated directly without any task-specific training or fine-tuning. Objective. We evaluated in-distribution performance by holding out a portion of the pre-training datasets as test sets, ensuring no data leakage. All models were tested in zero-shot settings. (2) originally regular datasets, i.e.,MIT-BIH [79], Johns Hopkins COVID-19 Dataset [80] ,CDC Influenza Hospitalizations Admissions (CDC-IHA) 1, Heart Rate [81] and illness [82], for which we simulate irregularity by randomly masking 30% of time points.
Hardware Specification	Yes	All models are trained on up to eight NVIDIA 80GB A100 GPUs with a micro-batch size of 128 and a maximum sequence length of 512.
Software Dependencies	No	The paper mentions "Neural ODEs [40]" and "ODE solver (i.e. the Dormand-Prince (RK45) method)" but does not specify any software libraries or their versions (e.g., PyTorch, TensorFlow, scikit-learn, etc. with version numbers).
Experiment Setup	Yes	All models are trained on up to eight NVIDIA 80GB A100 GPUs with a micro-batch size of 128 and a maximum sequence length of 512. We pre-trained one epoch with each training step processes approximately 65,000 time points. We consider forecast horizons of 24, 32, 48, 64 for short-term and long-term evaluation. Following standard practice, we apply an auxiliary load balancing loss with weight α = 0.02 to encourage expert utilization.