Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

Authors: Eunbyeol Cho, Jiyoun Kim, Minjae Lee, Sungjin Park, Edward Choi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Validated on two open-source EHR datasets, Raw Med outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/Raw Med.
Researcher Affiliation	Collaboration	KAIST1 Furiosa AI2
Pseudocode	Yes	Algorithm 1: Postprocessing for Relational Table Construction
Open Source Code	Yes	The code is available at https://github.com/eunbyeol-cho/Raw Med.
Open Datasets	Yes	In this study, we used two publicly available EHR datasets, MIMIC-IV [28] and e ICU [29].
Dataset Splits	Yes	The dataset was split into a 9:1 ratio, allocating 90% for training and 10% for testing.
Hardware Specification	Yes	Training is performed on a single NVIDIA A6000 GPU, completing in under 24 hours. Training was conducted on three NVIDIA A6000 GPUs for MIMIC and on a single NVIDIA A6000 GPU for e ICU, both completing in under 48 hours. Training ran for 2 epochs on a single RTX 3090 GPU, taking approximately 10 days for MIMIC-IV and 20 days for e ICU. Training took less than 10 hours on a single NVIDIA RTX 3090 GPU without early stopping. Training was completed in under 3 hours on a single NVIDIA A6000 GPU.
Software Dependencies	No	The paper mentions several models and optimizers like AdamW, Bio+Clinical BERT tokenizer, Flash Attention-2, Meta-LLaMA-3.1-8B, but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	The event compression module uses the Adam W optimizer (learning rate: 5e-4, weight decay: 0.01), processing batches of 4096 events for up to 200 epochs, with early stopping after 10 epochs of stagnant validation accuracy. The loss function combines reconstruction losses for text, type, and digit-place embeddings with a commitment loss (commitment cost: 1.0). An EMA decay factor of 0.8 is applied for codebook updates in VQ-VAE training. A dropout rate of 0.2 is used for regularization. The input embedding has sequence length L = 128 and embedding dimension F = 256, compressed to a latent representation with Lz = 4 and Fz = 256. The codebook contains K = 1024 entries, with residual quantization (RQ) using a depth of D = 2, yielding a 4 2 code representation. The temporal modeling module uses the Adam W optimizer (learning rate: 3e-4, weight decay: 0.01), with batches of 32 sequences for up to 200 epochs, with early stopping after 10 epochs without improvement. The Tempo Transformer consists of 12 layers, each with 8 attention heads, a hidden dimension of 512, and a feed-forward dimension of 2048.