Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Authors: Zheng Zhang
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them a failure rooted not in knowledge access, but in computational execution. We provide empirical evidence for each architectural constraint through embedding analysis, layer-by-layer computation tracking, and systematic evaluation across arithmetic and logical tasks. Section C Metacognitive Assessment Experiments |
| Researcher Affiliation | Industry | Zheng Zhang EMAIL This work was completed as a personal research project while employed at Amazon Web Services. The views expressed are those of the author and do not necessarily reflect those of Amazon. |
| Pseudocode | Yes | Algorithm 1 Idealized Metacognitive Control Loop |
| Open Source Code | Yes | Code repository: https://github.com/zzhang-cn/comprehension-without-competence. |
| Open Datasets | Yes | ARC-AGI-2 details from pre-release technical reports available at https://github.com/fchollet/ARC-AGI |
| Dataset Splits | No | The paper does not provide traditional training/test/validation dataset splits. Instead, it evaluates pre-trained models on custom-generated problem instances. For example, Section C.1 states: "We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model." This describes the test set generation rather than dataset splits. |
| Hardware Specification | No | The paper mentions evaluating models like LLa MA2-7B-chat, Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash, but it does not specify the hardware used by the authors to run their experiments. |
| Software Dependencies | No | The paper refers to various large language models (e.g., Claude Sonnet 4, GPT-4o, Gemini 2.5 Flash, LLaMA2) that were tested. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for conducting the experiments. |
| Experiment Setup | Yes | To test the importance of metacognition, we designed a tightly controlled experiment using n-digit n-digit multiplication tasks with place-value decomposition and simulated column-wise addition. We tested Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash across three conditions, with results summarized in Table 3: ... We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model. A sample prompt for golden decomposition is also provided in Section C.2. |