Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Authors: Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindström, Lucia Donatelli, Kanishka Misra, Najoung Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation.
Researcher Affiliation	Academia	Yulu Qin,1, Dheeraj Varghese,2, Adam Dahlgren Lindström,3 Lucia Donatelli,4 Kanishka Misra,5,6, and Najoung Kim1, 1Boston University, 2University of Amsterdam, 3Umeå University, 4Vrije Universiteit Amsterdam, 5TTIC, 6The University of Texas at Austin
Pseudocode	No	The paper only describes methods in natural language and illustrates a pipeline with a figure, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code can be found at https://github.com/tinlaboratory/taxonomigqa
Open Datasets	Yes	To this end, we develop Taxonomi GQA, a synthetically augmented text-only version of the popular visual-question answering (VQA) dataset GQA [19], where a subset of Word Net [40] hierarchy is used to create questions that require taxonomic knowledge. ... The original GQA dataset was released under CC BY 4.0 and we downloaded the dataset from https://cs.stanford.edu/people/dorarad/gqa/download.html. We follow this and release Taxonomi GQA under the same license, CC BY 4.0. ... We used images from THINGS [15], a dataset with 26,107 high-quality, manually curated object-centric images of 1,854 diverse object concepts.
Dataset Splits	Yes	We applied a multi-stage filtering process to the validation split of GQA (10,696 images/scenes and 488,293 questions) to obtain our base questions. ... The final dataset contains 1,342 unique images/scenes, 29,604 positive sample instances (9,334 targeting leaf node concepts, 20,270 targeting hypernym-substitutions), and 4 negative samples for each positive sample, amounting to 148,020 total instances.
Hardware Specification	Yes	Vision tasks were processed on a single NVIDIA A40 GPU (48GB) over 3 hours, while text-only tasks were run on two NVIDIA L40 GPUs (48GB each) for approximately 1.5 hours. ... Static embeddings were computed in under 10 minutes on an L40 GPU. TAXOMPS, RSA on unembedding layer vectors, contextualized representational similarity analysis, and PCA analysis were conducted on a single NVIDIA RTX6000 Ada (48GB) GPU, and took a total of 1 hour, 1.5 hours, 4 hours, and 1 hour, respectively.
Software Dependencies	No	Model Inference was conducted using v LLM[27]. Representation extraction and TAXOMPS behavioral analyses were performed using the minicons library [41]. All plots were produced using the ggplot2 library in the R programming language.
Experiment Setup	Yes	Since Taxonomi GQA consists of Yes/No questions, we sampled from a constrained probability distribution of Yes and No tokens from the models output vocabulary, allowing for surface form variation such as casing and space-prefixing. ... We selected seven LM-VLM model pairs, where the LM has been reported to be the base model that the VLM has been trained on top of, following the approach of [24]. The selected pairs are: (1) Llama-3.1-8B vs. MLlama-3.2-11B [12]; (2) their instruct versions; (3) Vicuna vs. Llava-1.5-7B [33]; (4) Mistral-v0.2-I [22] vs. Llava-Next [34]; (5) Qwen2-7B [70] and Molmo-7B-D [8]; (6) Qwen2-7B-Instruct vs. Llava-One Vision [29]; and (7) Qwen2.5-7B-Instruct [71] vs. Qwen2.5-7B-VL-Instruct [4].