Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fine-grained List-wise Alignment for Generative Medication Recommendation
Authors: Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zhao Zihao, Fuli Feng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame. |
| Researcher Affiliation | Academia | University of Science and Technology of China EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and formulas (e.g., JGRP O(θ), ˆAi,t, φ(step(t))) and process illustrations (Figure 2), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code is available at https://github.com/cxfann/Flame. |
| Open Datasets | Yes | Datasets. We use real-world EHR datasets: MIMIC-III [14] for training and evaluation, and MIMIC-IV [15] and e ICU [16] for generalization testing. DDI relations are obtained from TWOSIDES [11]. |
| Dataset Splits | Yes | Following prior works, MIMIC-III is split into training/validation/test sets (4:1:1). |
| Hardware Specification | Yes | We conduct all experiments on NVIDIA A100-SXM4-80GB GPUs, with Python 3.10 and Py Torch 2.5.1. |
| Software Dependencies | Yes | We conduct all experiments on NVIDIA A100-SXM4-80GB GPUs, with Python 3.10 and Py Torch 2.5.1. During all training stages, the model is quantized using bf16 and Lo RA, and optimized using the adamw_torch optimizer. ... We use the Unsloth framework and v LLM 0.7.3 for acceleration. |
| Experiment Setup | Yes | For the SFT of drug-level classifier πcls, we use Llama3-Aloe-8B-Alpha as the base model. Four projectors are randomly initialized: pat_projector, diag_projector, pro_projector, and med_projector, each consisting of a two-layer MLP with GELU activation. The learning rate is set to 5e-4, with a batch size of 128 and one epoch. For the SFT of list-wise policy πlist, the model and projector weights from the previous step (πcls) are used as initialization. The learning rate remains 5e-4, with a batch size of 64 and one epoch. When performing step-wise GRPO on πlist, we initialize the model and projector weights from the previous SFT step. The projector parameter r_grad is set to False. ... The learning rate is set to 1e-5, with a batch size of 16, num_generations set to 8, and one epoch. The hyperparameter α is chosen from the set [0, 2, 5, 10, 20, 30, 40, 50] to adapt to different DDI requirements, while β is set to 0.5, and λ is set to 5. |