Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Lost in Transmission: When and Why LLMs Fail to Reason Globally

Authors: Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, Jennifer Neville

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments corroborate our theoretical predictions: GPT-4o, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks.
Researcher Affiliation Industry Tobias Schnabel Microsoft Research Kiran Tomlinson Microsoft Research Adith Swaminathan Netflix Jennifer Neville Microsoft Research
Pseudocode No The paper describes computational steps and algorithms in prose and mathematical notation (e.g., in Theorem 8 and Definition 1) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/microsoft/bapo.
Open Datasets Yes We used hotel reviews from the SPACE dataset4 [2]. ... Available at https://github.com/stangelid/qt under an MIT License.
Dataset Splits No For all problems, we generate 100 i.i.d. instances and report average accuracy along with the 95% t-test confidence interval. ... We use a grid of n {6, 50, 100, 200} but resulting list lengths might deviate slightly if a problem requires odd numbers. We generated an equal number and positive and negative instances where applicable.
Hardware Specification No The experiments took 1 day and $400 of API credits to run ($93 of which were for o3 alone), with preliminary experiments taking an additional $150 of API credits. The paper mentions API credits, implying cloud-based execution, but does not specify particular CPU or GPU models, memory, or other hardware details.
Software Dependencies Yes GPT 4o gpt-4o-2024-11-20 ... Claude 3.5 Sonnet claude-3-5-sonnet-20241022 ... Gemini 1.5 Pro gemini-1.5-pro-002
Experiment Setup Yes All models were forced to output a pre-set JSON schema... Temperature: 0... We generated an equal number and positive and negative instances where applicable. For the chain of thought variants, we pre-pended the following instructions: Think step by step on the Co T, but stay under 250 words.