Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Authors: Morris Alper, Hadar Averbuch-Elor

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test for sound symbolism in VLMs, we use zero-shot knowledge probing which allows for evaluating the models inherent knowledge i.e., not new knowledge acquired during further training. We leverage the ability of CLIP to embed text and image data in a shared semantic space in order to probe discriminative and generative (text-to-image) models with the same evaluation metrics. Our tests evaluate whether these models encode pseudowords similarly to humans with respect to known symbolic associations, comparing them to adjectives indicating properties related to sharpness and roundness . To further ground our results in human cognition, we also conduct a user study testing the ability of subjects to reconstruct pseudowords used to condition text-to-image generations. Our results demonstrate that sound symbolism can indeed be observed in VLMs; the models under
Researcher Affiliation Academia Morris Alper and Hadar Averbuch-Elor Tel Aviv University
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Our code will be made publicly available1. 1via our project page https://kiki-bouba.github.io/
Open Datasets Yes Further strengthening these findings, we provide strong evidence in the supplementary material that these models have not learned specifically from items illustrating the kiki bouba effect, by showing that this concept is not well-represented in the LAION dataset [54] upon which our VLMs were trained
Dataset Splits No The paper evaluates pre-trained models using a zero-shot approach and conducts a user study. It describes the data used for evaluation (pseudowords, adjectives, human responses), but it does not specify traditional machine learning train/validation/test splits for its own experimental data or analysis, as it is probing pre-trained models rather than training a new model from scratch.
Hardware Specification No No specific hardware components (e.g., GPU/CPU models, memory) are mentioned.
Software Dependencies No The paper mentions using
Experiment Setup Yes We use these models as-is and probe them in the zero-shot regime, without any further training or adjustment of their tokenizer (which is identical for both models). Prompts used: We use the following prompts to probe the models under consideration, where w is the item (word or pseudoword) to be inserted into the prompt: P1: a 3D rendering of a w object P2: a 3D rendering of a w shaped object. We manually select 20 ground-truth adjectives, split evenly between those with sharp or round associations which we denote by A and A respectively.