Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Semantic-guided Diverse Decoding for Large Language Model
Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show Sem Di D consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology, 2Meta X, 3Alibaba Group, 4Zhejiang Normal University, 5Soochow University |
| Pseudocode | Yes | C.1 Sem Di D Algorithm Algorithm 1 provides a detailed overview of the Semantic-guided Diverse Decoding (Sem Di D) procedure. |
| Open Source Code | Yes | The code is available at https://github.com/shiweijiezero/Sem Di D. |
| Open Datasets | Yes | Datasets. We evaluate the effectiveness of Sem Di D in Best-of-N settings across diverse tasks: Reasoning tasks include ARC-Challenge [27], Big Bench Hard (BBH) [28], GSM8K [29], and Minerva Math [30]. Question answering tasks include Co QA [31], Pub Med QA [32], and MMLUPro+ [33]. Machine translation tasks include WMT16 [34] (English-German, German-English). ... For our experiments, we employ Qwen-2.5-7B and Pythia-1B as base models, training them on the mathematical dataset GSM8K4 and the summarization dataset TLDR5. ... 4https://huggingface.co/datasets/openai/gsm8k 5https://huggingface.co/datasets/trl-lib/tldr |
| Dataset Splits | No | No explicit train/test/validation dataset splits are provided for the main datasets used in the experiments. Section A.1 mentions 'selecting 500 problems from each' for a specific analysis, and Section 4.1.2 refers to 'test examples' without detailing the split methodology or sizes. |
| Hardware Specification | Yes | All experiments were conducted on a cluster of 8 NVIDIA H800 GPUs. |
| Software Dependencies | No | The paper mentions specific LLM and embedding models used (e.g., Qwen-2.5-3B, Pythia-1B, Nova Search/stella_en_1.5B_v5) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 2 presents the default hyperparameters used in our experiments. These values were determined through extensive grid search optimization on a held-out validation set. All experiments were conducted on a cluster of 8 NVIDIA H800 GPUs. |