Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning

Authors: Zheng Zhang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them a failure rooted not in knowledge access, but in computational execution. We provide empirical evidence for each architectural constraint through embedding analysis, layer-by-layer computation tracking, and systematic evaluation across arithmetic and logical tasks. Section C Metacognitive Assessment Experiments
Researcher Affiliation Industry Zheng Zhang EMAIL This work was completed as a personal research project while employed at Amazon Web Services. The views expressed are those of the author and do not necessarily reflect those of Amazon.
Pseudocode Yes Algorithm 1 Idealized Metacognitive Control Loop
Open Source Code Yes Code repository: https://github.com/zzhang-cn/comprehension-without-competence.
Open Datasets Yes ARC-AGI-2 details from pre-release technical reports available at https://github.com/fchollet/ARC-AGI
Dataset Splits No The paper does not provide traditional training/test/validation dataset splits. Instead, it evaluates pre-trained models on custom-generated problem instances. For example, Section C.1 states: "We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model." This describes the test set generation rather than dataset splits.
Hardware Specification No The paper mentions evaluating models like LLa MA2-7B-chat, Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash, but it does not specify the hardware used by the authors to run their experiments.
Software Dependencies No The paper refers to various large language models (e.g., Claude Sonnet 4, GPT-4o, Gemini 2.5 Flash, LLaMA2) that were tested. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for conducting the experiments.
Experiment Setup Yes To test the importance of metacognition, we designed a tightly controlled experiment using n-digit n-digit multiplication tasks with place-value decomposition and simulated column-wise addition. We tested Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash across three conditions, with results summarized in Table 3: ... We evaluated two complexity levels: 5-digit numbers (range 10,000 99,999) and 10-digit numbers (range 1,000,000,000 9,999,999,999), with 20 problems per complexity level per model. A sample prompt for golden decomposition is also provided in Section C.2.