Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Authors: Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.
Researcher Affiliation	Collaboration	1State Key Laboratory of Communication Content Cognition, People s Daily Online 2University of Science and Technology of China
Pseudocode	Yes	Proposing-Aligning-Reducing Sampling. Given the input x, the proposed strategy involves the following steps: 1. Propose: At step i, propose n candidate tokens y1 l , ..., yn l independently from PM(yl\|y<l, x) by nucleus sampling. 2. Align: Each candidate is assigned an importance weight, serving as an aligning indicator: w(yi l) = Qθ(yi l\|y<l, x) Zθ(y<l, x) (10) where i {1, ..., n}. 3. Reduce: By introducing a normalizing factor C = Pn i=1 w(yi l), these importance weights are normalized into a categorical distribution Categorical( w(y1 l ) C , ..., w(yn l ) C ). The candidates are then reduced to a single token through categorical sampling.
Open Source Code	Yes	We provide them in the supplemental materials. We provide an anonymous link/zip to our code which can be used for generating data and training models.
Open Datasets	Yes	For the instruction following task, we randomly selected approximately 120,000 conversations from Ultra Chat [8], using the first round of chats for training. We utilized the entire TL;DR Summarization dataset [36] for domain adaptation and the complete Anthropic-HH dataset [4] for preference optimization.
Dataset Splits	Yes	For the instruction following task, we randomly selected approximately 120,000 conversations from Ultra Chat [8], using the first round of chats for training. We evaluate our models using the widely recognized open-ended benchmark, Alpaca Eval 2 [9], which assesses conversational capabilities across 805 questions sourced from five datasets. We utilized the entire TL;DR Summarization dataset [36] for domain adaptation and the complete Anthropic-HH dataset [4] for preference optimization. We evaluated model performance by randomly sampling 300 examples from both the helpful-base and harmless-base testing sets.
Hardware Specification	Yes	We used a node with 8 80GB A800 Nvidia GPUs.
Software Dependencies	No	The paper mentions "RMSprop optimizer" and "cosine learning rate schedule" but does not specify versions of any programming languages, libraries, or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We conduct preliminary experiments on each method to explore batch sizes of [32, 64, 128], learning rates of [1e-7, 2e-7, 5e-7, 1e-6], and training epochs of [1, 2, 3] using the Ultra Chat dataset. We find that a batch size of 64 and a single training epoch generally yield the best results across all methods, although the optimal learning rate varies. The SFT (including Aligner) and DPO training methods favor a larger learning rate of 1e-6, while our method, which introduces a gradient ascent term, prefers a smaller learning rate of 2e-7. Consequently, we fix these parameters for all subsequent experiments. Additionally, we set the maximum sequence length to 2048 and apply a cosine learning rate schedule with 10% warmup steps for the preference optimization dataset. For the Aligner, due to its reliance on reference answers, the maximum sequence length is extended to 3072, and we warm up the Aligner using around 10K examples. All models are trained using the RMSprop optimizer.