Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-grained List-wise Alignment for Generative Medication Recommendation

Authors: Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zhao Zihao, Fuli Feng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.
Researcher Affiliation	Academia	University of Science and Technology of China EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methods and formulas (e.g., JGRP O(θ), ˆAi,t, φ(step(t))) and process illustrations (Figure 2), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code is available at https://github.com/cxfann/Flame.
Open Datasets	Yes	Datasets. We use real-world EHR datasets: MIMIC-III [14] for training and evaluation, and MIMIC-IV [15] and e ICU [16] for generalization testing. DDI relations are obtained from TWOSIDES [11].
Dataset Splits	Yes	Following prior works, MIMIC-III is split into training/validation/test sets (4:1:1).
Hardware Specification	Yes	We conduct all experiments on NVIDIA A100-SXM4-80GB GPUs, with Python 3.10 and Py Torch 2.5.1.
Software Dependencies	Yes	We conduct all experiments on NVIDIA A100-SXM4-80GB GPUs, with Python 3.10 and Py Torch 2.5.1. During all training stages, the model is quantized using bf16 and Lo RA, and optimized using the adamw_torch optimizer. ... We use the Unsloth framework and v LLM 0.7.3 for acceleration.
Experiment Setup	Yes	For the SFT of drug-level classifier πcls, we use Llama3-Aloe-8B-Alpha as the base model. Four projectors are randomly initialized: pat_projector, diag_projector, pro_projector, and med_projector, each consisting of a two-layer MLP with GELU activation. The learning rate is set to 5e-4, with a batch size of 128 and one epoch. For the SFT of list-wise policy πlist, the model and projector weights from the previous step (πcls) are used as initialization. The learning rate remains 5e-4, with a batch size of 64 and one epoch. When performing step-wise GRPO on πlist, we initialize the model and projector weights from the previous SFT step. The projector parameter r_grad is set to False. ... The learning rate is set to 1e-5, with a batch size of 16, num_generations set to 8, and one epoch. The hyperparameter α is chosen from the set [0, 2, 5, 10, 20, 30, 40, 50] to adapt to different DDI requirements, while β is set to 0.5, and λ is set to 5.