Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Textural or Textual: How Vision-Language Models Read Text in Images

Authors: Hanzhang Wang, Qingyuan Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the To T dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layerwise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth.
Researcher Affiliation	Academia	1School of Computer Engineering and Science, Shanghai University, Shanghai, China.
Pseudocode	Yes	Algorithm 1 Intrinsic Dimension Estimation Across Layers
Open Source Code	Yes	The code is available at: https://github. com/Ovsia/Textural-or-Textual
Open Datasets	Yes	We propose the To T (Textural or Textual) dataset, derived from Image Net-1k, which features 100 categories of common objects overlaid with texts of varying semantics. ... We perform cross-dataset evaluations using the respective test sets provided by each method. ... publicly available typographic attack datasets: Disentangle (Materzy nska et al., 2022), PAINT (Ilharco et al., 2022), and Prefix (Azuma & Matsui, 2023), which all feature handwritten text overlaid on notepads.
Dataset Splits	Yes	For each pair in subset 2 of the To T dataset, we use 320 image samples for training and 80 for testing.
Hardware Specification	Yes	All of our experiments are conducted on a Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We use a batch size of 512 and a learning rate of 1 10 4, with a weight decay of 0.2. The Adam optimizer is employed for fine-tuning.