Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixture of Inputs: Text Generation Beyond Discrete Token Sampling

Authors: Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MOI across a range of tasks including mathematical reasoning, code generation, and graduate-level question answering where maintaining uncertainty can play a crucial role in step-bystep inference. Across these domains, MOI brings consistent performance improvements for multiple models including Qw Q-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B. [...] Table 1 reports accuracy on four reasoning-intensive benchmarks for four open-source LLMs.
Researcher Affiliation Collaboration Yufan Zhuang1, Liyuan Liu2, Chandan Singh2, Jingbo Shang1, and Jianfeng Gao2 1UC San Diego 2Microsoft Research
Pseudocode Yes Algorithm 1: Mixture of Inputs
Open Source Code Yes Code is available at: https://github.com/Evan Zhuang/mixinputs.
Open Datasets Yes AIME [26] consists of complex high-school level mathematical problems... Count Down 4 [27] is a synthetic numerical reasoning task... Live Code Bench [28] is a dynamic and realistic code generation benchmark... GPQA [29] is a highly challenging multiple-choice question answering benchmark... We assembled three prompt pools: (1) binary sentiment analysis on Rotten Tomatoes [36], SST2 [37], and IMDB [38]... (2) 6-class emotion classification on the Emotion dataset [39]... and (3) 3-class financial sentiment on the Financial Phrasebank [40]...
Dataset Splits Yes AIME [26] consists of complex high-school level mathematical problems that often require multiple stages of symbolic reasoning, algebraic manipulation, and geometric insight. We use the official AIME datasets from 2022 to 2024 and evaluate models based on exact match accuracy... Count Down 4 [27] is a synthetic numerical reasoning task... Live Code Bench [28] is a dynamic and realistic code generation benchmark... GPQA [29] is a highly challenging multiple-choice question answering benchmark... We conducted additional experiments on MT-Bench [43], using the four larger models...
Hardware Specification No We implement MOI on top of the v LLM framework [47], which supports efficient tensor parallelism. Mixing weights are computed from both the output token and the associated logits after each generation step. The resulting mixed inputs are cached and used as input for the subsequent decoding step.
Software Dependencies No We implement MOI on top of the v LLM framework [47], which supports efficient tensor parallelism. [...] We fit a Random Forest Regressor from SCIKIT-LEARN [44] with 100 trees that have unrestricted depth.
Experiment Setup Yes We perform 5 runs for all experiments and report the average. For AIME and Count Down 4, we perform hyperparameter grid search on baselines, Direct Mixture and MOI with β { 1/2, 1, 2, 4, 8}, T {0.6, 0.8, 1} and top-p {0.4, 0.6, 0.8, 0.95}. We report the mean result of the best configuration for all three methods. We investigate the importance of these hyperparameters in Section 6.2. For GPQA-Diamond and Live Code Bench, we use the universal hyperparameter for all of them with T = 0.6, top-p = 0.95, β = 1; more details can be found in Appendix F. [...] Table A5 lists the full search space for AIME and COUNT DOWN 4, along with the universal settings used for GPQA-DIAMOND and LIVECODEBENCH.