Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Semantic-guided Diverse Decoding for Large Language Model

Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show Sem Di D consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology, 2Meta X, 3Alibaba Group, 4Zhejiang Normal University, 5Soochow University
Pseudocode Yes C.1 Sem Di D Algorithm Algorithm 1 provides a detailed overview of the Semantic-guided Diverse Decoding (Sem Di D) procedure.
Open Source Code Yes The code is available at https://github.com/shiweijiezero/Sem Di D.
Open Datasets Yes Datasets. We evaluate the effectiveness of Sem Di D in Best-of-N settings across diverse tasks: Reasoning tasks include ARC-Challenge [27], Big Bench Hard (BBH) [28], GSM8K [29], and Minerva Math [30]. Question answering tasks include Co QA [31], Pub Med QA [32], and MMLUPro+ [33]. Machine translation tasks include WMT16 [34] (English-German, German-English). ... For our experiments, we employ Qwen-2.5-7B and Pythia-1B as base models, training them on the mathematical dataset GSM8K4 and the summarization dataset TLDR5. ... 4https://huggingface.co/datasets/openai/gsm8k 5https://huggingface.co/datasets/trl-lib/tldr
Dataset Splits No No explicit train/test/validation dataset splits are provided for the main datasets used in the experiments. Section A.1 mentions 'selecting 500 problems from each' for a specific analysis, and Section 4.1.2 refers to 'test examples' without detailing the split methodology or sizes.
Hardware Specification Yes All experiments were conducted on a cluster of 8 NVIDIA H800 GPUs.
Software Dependencies No The paper mentions specific LLM and embedding models used (e.g., Qwen-2.5-3B, Pythia-1B, Nova Search/stella_en_1.5B_v5) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Table 2 presents the default hyperparameters used in our experiments. These values were determined through extensive grid search optimization on a held-out validation set. All experiments were conducted on a cluster of 8 NVIDIA H800 GPUs.