Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Authors: Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. Moreover, direct comparison with prior human data on the same task shows that the models responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. |
| Researcher Affiliation | Academia | Tom Kouwenhoven , Kiana Shahrasbi, Tessa Verhoef Leiden Institute of Advanced Computer Science Leiden University, The Netherlands EMAIL |
| Pseudocode | No | The paper describes the methods and analyses used in paragraph form and through figures, but it does not contain a dedicated pseudocode block or algorithm. |
| Open Source Code | No | Our source code and the data will be published on OSF upon publication. |
| Open Datasets | Yes | The source data and code are available at https://osf.io/gqsv6/ |
| Dataset Splits | No | The paper evaluates pre-trained CLIP models on various linguistic and visual inputs. It describes the generation of some images and the sourcing of others, as well as the construction of pseudowords. However, it does not specify any training, testing, or validation splits for these inputs, as the models being evaluated are already trained. |
| Hardware Specification | Yes | Our experiments do not need heavy computing since we do not train new models or fine-tune anything. All the code and analyses ran on a single Mac Book Air M1. |
| Software Dependencies | Yes | All our analyses use Bayesian Regression Models as implemented in the brms package (Bรผrkner, 2021) in R (R Core Team, 2024). |
| Experiment Setup | Yes | We fit models (using 4 chains of 4000 iterations and a warm-up of 2000) to predict the proportion of correct guesses given a Word_type with fixed effects for model, prompt, image pair, or label pair. The exact model formulas are displayed under each figure. |