Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Textural or Textual: How Vision-Language Models Read Text in Images

Authors: Hanzhang Wang, Qingyuan Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the To T dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layerwise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth.
Researcher Affiliation Academia 1School of Computer Engineering and Science, Shanghai University, Shanghai, China.
Pseudocode Yes Algorithm 1 Intrinsic Dimension Estimation Across Layers
Open Source Code Yes The code is available at: https://github. com/Ovsia/Textural-or-Textual
Open Datasets Yes We propose the To T (Textural or Textual) dataset, derived from Image Net-1k, which features 100 categories of common objects overlaid with texts of varying semantics. ... We perform cross-dataset evaluations using the respective test sets provided by each method. ... publicly available typographic attack datasets: Disentangle (Materzy nska et al., 2022), PAINT (Ilharco et al., 2022), and Prefix (Azuma & Matsui, 2023), which all feature handwritten text overlaid on notepads.
Dataset Splits Yes For each pair in subset 2 of the To T dataset, we use 320 image samples for training and 80 for testing.
Hardware Specification Yes All of our experiments are conducted on a Ge Force RTX 3090 GPU.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use a batch size of 512 and a learning rate of 1 10 4, with a weight decay of 0.2. The Adam optimizer is employed for fine-tuning.