Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning

Authors: Zheng Zhang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them a failure rooted not in knowledge access, but in computational execution. We provide empirical evidence for each architectural constraint through embedding analysis, layer-by-layer computation tracking, and systematic evaluation across arithmetic and logical tasks. Section C Metacognitive Assessment Experiments
Researcher Affiliation	Industry	Zheng Zhang EMAIL This work was completed as a personal research project while employed at Amazon Web Services. The views expressed are those of the author and do not necessarily reflect those of Amazon.
Pseudocode	Yes	Algorithm 1 Idealized Metacognitive Control Loop
Open Source Code	Yes	Code repository: https://github.com/zzhang-cn/comprehension-without-competence.
Open Datasets	Yes	ARC-AGI-2 details from pre-release technical reports available at https://github.com/fchollet/ARC-AGI
Dataset Splits	No	The paper does not provide traditional training/test/validation dataset splits. Instead, it evaluates pre-trained models on custom-generated problem instances. For example, Section C.1 states: "We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model." This describes the test set generation rather than dataset splits.
Hardware Specification	No	The paper mentions evaluating models like LLa MA2-7B-chat, Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash, but it does not specify the hardware used by the authors to run their experiments.
Software Dependencies	No	The paper refers to various large language models (e.g., Claude Sonnet 4, GPT-4o, Gemini 2.5 Flash, LLaMA2) that were tested. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for conducting the experiments.
Experiment Setup	Yes	To test the importance of metacognition, we designed a tightly controlled experiment using n-digit n-digit multiplication tasks with place-value decomposition and simulated column-wise addition. We tested Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash across three conditions, with results summarized in Table 3: ... We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model. A sample prompt for golden decomposition is also provided in Section C.2.