Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DISC: Dynamic Decomposition Improves LLM Inference Scaling

Authors: Jonathan Li, Wei Cheng, Benjamin Riviere, Yue Wu, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on benchmarks such as APPS, MATH, and Live Code Bench demonstrate that dynamic decomposition outperforms static approaches, including token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively.
Researcher Affiliation	Collaboration	Jonathan Light1,2,4 , Wei Cheng2B, Benjamin Riviere4, Yue Wu3, Masafumi Oyamada5, Mengdi Wang3, Yisong Yue4, Santiago Paternain1, Haifeng Chen2 1Rensselaer Polytechnic Institute, 2NEC Laboratories America, 3Princeton University, 4California Institute of Technology, 5NEC Corporation
Pseudocode	Yes	Algorithm 1 DISC with Greedy Search Algorithm 2 Dynamic Decomposition Algorithm 3 DISC: Decomposition for Plug-and-Play Search
Open Source Code	No	The code and data are not yet publicly available at submission time due to anonymity constraints. However, we commit to releasing the full implementation including scripts for running DISC, benchmark evaluation, and reproduction of all figures and tables upon publication. The code will be made available on Git Hub with comprehensive instructions to ensure faithful reproduction of results.
Open Datasets	Yes	Benchmarks. We evaluate DISC on three benchmarks: APPS, MATH, and Live Code Bench, to assess its impact on inference scaling for both coding and reasoning. APPS [18] consists of 5000 competitive programming problems... MATH [35] comprises 12,500 math problems... Live Code Bench [38] is a continuously updated dataset from Leetcode, At Coder, and Code Forces...
Dataset Splits	Yes	We evaluate on a 200-problem subset due to computational constraints. We test on a 500-problem subset (MATH500), identical to prior work [37, 23]. We evaluate on the 108 problems uploaded between 10/01/2024 and 12/01/2024 to prevent contamination.
Hardware Specification	Yes	For open-source model experiments (e.g., LLa MA-3.1-8B-Instruct, Mistral, Qwen), inference was conducted on a single NVIDIA A100 GPU. All such experiments were executed sequentially on this GPU, which has 80GB of memory, ensuring consistent and reproducible runtime characteristics across model families.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are explicitly mentioned in the paper.
Experiment Setup	Yes	We use α0 = 0.15, σ = 1.0, and temperature τ = 0.2 by default for DISC unless otherwise specified.