Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning
Authors: Man Ho Lam, Chaozheng Wang, Jen-Tse Huang, Michael R Lyu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce CODECRASH, a stress-testing framework with 1,279 questions from CRUXEVAL and LIVECODEBENCH, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. Through a systematic evaluation of 17 LLMs, we find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks. Even with Chain-of-Thought reasoning, models on average still have a 13.8% drop due to distractibility and rationalization, revealing a lack of critical reasoning capability to distinguish the actual code behaviors. While Large Reasoning Models with internal reasoning mechanisms improve robustness by fostering critical thinking, plausible yet incorrect hints can trigger pathological self-reflection, causing 2 3 times token consumption and even catastrophic cognitive dissonance in extreme cases for Qw Q-32B. We refer to this phenomenon as Reasoning Collapse. CODECRASH provides a rigorous benchmark for evaluating robustness in code reasoning, guiding future research and development toward more reliable and resilient models. |
| Researcher Affiliation | Academia | Man Ho Lam The Chinese University of Hong Kong EMAIL Chaozheng Wang The Chinese University of Hong Kong EMAIL Jen-tse Huang Johns Hopkins University EMAIL Michael R. Lyu The Chinese University of Hong Kong EMAIL |
| Pseudocode | No | The paper includes 'Code' examples (e.g., Code 1, Code 2, etc.) which are actual Python snippets, and a pipeline diagram (Figure 1), but no section or figure explicitly labeled 'Pseudocode' or 'Algorithm' with structured, language-agnostic steps. The NeurIPS checklist for 'Theory assumptions and proofs' also states: 'This paper does not include theoretical analysis.' |
| Open Source Code | Yes | Answer: [Yes] Justification: Please refer to the supplementary materials. We provide the source code and have recorded all raw experimental results. |
| Open Datasets | Yes | We introduce CODECRASH, a stress-testing framework with 1,279 questions from CRUXEVAL and LIVECODEBENCH, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. We adopt input and output predictions from CRUXEVAL (CRUX) (Gu et al., 2024) and extend them with LIVECODEBENCH (LCB) (Jain et al., 2025) to cover real-world programs. We use the publicly released CRUXEVAL (Gu et al., 2024) and LIVECODEBENCH (Jain et al., 2025) datasets, both of which are cited appropriately in the paper and used in accordance with their licenses or terms of use. |
| Dataset Splits | No | The paper states it uses '1,279 questions from CRUXEVAL and LIVECODEBENCH', and provides the total number of problems for each (CRUX contains 800 synthetic problems, LCB contains 479 real-world coding problems). However, it does not explicitly provide any training/test/validation splits of these problems for its experiments. It evaluates pre-trained LLMs on these problem sets without defining further internal splits. |
| Hardware Specification | No | All experiments were conducted via API access (Open AI, Anthropic, Gemini, Deep Infra, and Qwen), and thus no information regarding memory, device type, or execution time is applicable. We record all necessary details, including model versions, input prompts, and output responses. |
| Software Dependencies | No | The paper mentions using 'Python' implicitly through code examples but does not specify a version number or list any other key software components (libraries, frameworks, solvers) with their specific version numbers that would be necessary for reproducing the experimental setup. It mentions model configurations like 'nucleus sampling (temperature=0.2, top-p=0.95)' which are model parameters, not ancillary software dependencies with version numbers. |
| Experiment Setup | Yes | Model Configurations. Following LCB, we use nucleus sampling (temperature=0.2, top-p=0.95) with a maximum of 200 tokens for direct inference and 2,000 for Co T prompting, discarding excessively long outputs. Due to resource constraints, we generate three candidates for direct inference and one for Co T inference; we provide a result stability analysis under N = 3 in Appendix C. Following CRUX, we employ 2-shot prompting for direct inference and 1-shot for Co T step-by-step execution (details provided in Appendix D). |