Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect

Authors: Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. Moreover, direct comparison with prior human data on the same task shows that the models responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition.
Researcher Affiliation	Academia	Tom Kouwenhoven , Kiana Shahrasbi, Tessa Verhoef Leiden Institute of Advanced Computer Science Leiden University, The Netherlands EMAIL
Pseudocode	No	The paper describes the methods and analyses used in paragraph form and through figures, but it does not contain a dedicated pseudocode block or algorithm.
Open Source Code	No	Our source code and the data will be published on OSF upon publication.
Open Datasets	Yes	The source data and code are available at https://osf.io/gqsv6/
Dataset Splits	No	The paper evaluates pre-trained CLIP models on various linguistic and visual inputs. It describes the generation of some images and the sourcing of others, as well as the construction of pseudowords. However, it does not specify any training, testing, or validation splits for these inputs, as the models being evaluated are already trained.
Hardware Specification	Yes	Our experiments do not need heavy computing since we do not train new models or fine-tune anything. All the code and analyses ran on a single Mac Book Air M1.
Software Dependencies	Yes	All our analyses use Bayesian Regression Models as implemented in the brms package (Bürkner, 2021) in R (R Core Team, 2024).
Experiment Setup	Yes	We fit models (using 4 chains of 4000 iterations and a warm-up of 2000) to predict the proportion of correct guesses given a Word_type with fixed effects for model, prompt, image pair, or label pair. The exact model formulas are displayed under each figure.