Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Expectations: Quantile-Guided Alignment for Risk-Calibrated Language Models

Authors: Xinran Wang, Jin Du, Azal Khan, qi le, Enmao Diao, Jiawei Zhou, Jie Ding, Ali Anwar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on conversation and code-generation tasks show that quantile alignment significantly enhances quality at targeted tails while maintaining overall performance. We evaluate quantile alignment on conversational and code-generation tasks, where model outputs range from benign to risky behaviors. Experiments were conducted on a single Nvidia A100 GPU.
Researcher Affiliation Collaboration 1Department of Computer Science and Engineering, University of Minnesota 2Morph Mind AI 3Department of Applied Mathematics & Statistics, Stony Brook University 4School of Statistics, University of Minnesota
Pseudocode Yes Algorithmic Steps. Next, we summarize the procedure for solving the QA problem numerically. 1. Sampling. Draw n samples {(xℓ, yℓ)}n ℓ=1 from p0. 2. Compute Indicator-Based Rewards. For each ℓ, evaluate ρτj,κj(r(xℓ, yℓ)) for each τj, κj. 3. Dual Update. Initialize an m-dimensional λ(0) 0 and perform gradient ascent: λ(t+1) λ(t) + η λ bg(λ(t)) until convergence, where ( )+ denotes projection onto the nonnegative orthant, and η > 0 is the step size. If it diverges, we decide the constraints are infeasible. 4. Construct QA Reward. Once we obtain the dual solution λ , compute the effective reward: j=1 λ j ρτj,κj(r(x, y)). 5. Optimize p based on the QA reward. Treat R(x, y) as the reward function in the standard RLHF setting with β = 1 and apply a PPO solver to update from p0 to p.
Open Source Code Yes Code is uploaded in the supplementary material.
Open Datasets Yes For the conversational task, we use prompts from the Anthropic Harmless dataset [35], which contains human requests formatted between Human: and Assistant: . ... For the code-generation task, we employ the HUMANEVAL dataset [5], a standard benchmark that consists of Python programming tasks.
Dataset Splits No The paper mentions using prompts from the Anthropic Harmless dataset and the HUMANEVAL dataset. It states, "Each result is computed over the full evaluation set, so variability due to sampling is negligible; we therefore omit error bars." However, it does not specify explicit training, validation, or test splits for these datasets to reproduce the experiment's data partitioning or how the models were aligned using these datasets.
Hardware Specification Yes Experiments were conducted on a single Nvidia A100 GPU.
Software Dependencies No This enables us to leverage existing RLHF solvers, such as the Proximal Policy Optimization (PPO) algorithm implemented in the TRL package [32]. The paper mentions the TRL package but does not provide a specific version number for TRL or any other key software dependencies, which is required for reproducibility.
Experiment Setup No The paper mentions that the effective reward can be used in standard RLHF with "inverse temperature β = 1" and that for MORL, "the dual weight vector λ is generated as λ = s u, where s is uniformly sampled from (0, 6) and u is sampled from the probability simplex". However, it does not provide specific hyperparameters for the PPO solver itself, such as learning rate, batch size, number of epochs, or the specific optimizer used, which are crucial for replicating the experimental setup.