Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Authors: Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, Anurag Beniwal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluating 20+ models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and Complex Bench (with complex, compositional constraints), we consistently observe performance drops when Co T prompting is applied.
Researcher Affiliation Collaboration Xiaomin Li Harvard University Zhou Yu Amazon Zhiwei Zhang Amazon Xupeng Chen NYU Ziji Zhang Amazon Yingying Zhuang Amazon Narayanan Sadagopan Amazon Anurag Beniwal Amazon
Pseudocode No The paper presents mathematical formulas for calculating attention scores and defines a 'Constraint Attention' metric, but it does not include any clearly labeled pseudocode or algorithm blocks. Appendix F contains 'Prompt Templates' which are textual instructions, not algorithmic pseudocode.
Open Source Code Yes Code and data for reproducing experiments are available at: https://github.com/amazon-science/when-thinking-fails-RLLM-if-evaluation.
Open Datasets Yes We use two benchmark datasets, IFEval and Complex Bench, to comprehensively evaluate the instruction-following capabilities of language models: IFEval [Zhou et al., 2023] consists of prompts with simple, independently verifiable constraints... In contrast, Complex Bench [Wen et al., 2024] includes instructions formed through compositional logic...
Dataset Splits Yes We split the dataset evenly, using 50% of the samples for training and the remaining 50% to evaluate downstream mitigation effectiveness.
Hardware Specification Yes Open-source models are run without quantization using 4 NVIDIA-H100-80GB GPUs. ... For each classifier, we perform full fine-tuning using a single NVIDIA-H100-80GB GPU.
Software Dependencies No The paper mentions various LLMs used for evaluation (e.g., Llama, Mixtral, Qwen, Claude, DeepSeek) and specifies Qwen2.5-7B-Instruct as the backbone for the classifier, but it does not provide specific version numbers for programming languages or libraries (like Python or PyTorch) required to reproduce the experiments.
Experiment Setup Yes All model inferences use a temperature of 0. ... The model is trained for 3 epochs with a learning rate of 1e-5.