Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors: Shuhao Fu, Andrew Jun Lee, Yixin Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor Whittington Webb

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and Intern VL2.5-38B), and compare the performance of these systems to human participants.
Researcher Affiliation Collaboration Shuhao Fu*, Andrew Jun Lee*, Anna Wang Department of Psychology, University of California, Los Angeles; Ida Momennejad Microsoft Research, NYC; Trevor Bihl Air Force Research Laboratory; Hongjing Lu EMAIL Department of Psychology, Department of Statistics, University of California, Los Angeles; Taylor Webb EMAIL Microsoft Research, NYC. The affiliations include academic institutions (University of California, Los Angeles) and industry/government research labs (Microsoft Research, Air Force Research Laboratory), indicating a collaboration.
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled or presented in a structured format in the paper.
Open Source Code Yes All experimental materials, code, and data are available at: https://github.com/andrewjlee0/evaluating_compositionality_VLMs
Open Datasets Yes We first evaluated relational concept learning with real-world scenes, using the Bongard-HOI dataset (Jiang et al., 2022).
Dataset Splits Yes For Bongard-HOI: In the standard evaluation methodology, 6 labeled images from each class are presented (12 total), and the remaining positive and negative images are presented for classification. evaluation was performed by presenting 9 labeled example images (randomly selecting either 5 positive examples and 4 negative examples, or vice versa) followed by a single query image. For SVRT: For each problem, we presented 1-9 few-shot examples, consisting of a mixture of positive and negative instances (in random order).
Hardware Specification Yes two open-source VLMs: QWEN2-VL-72B and Intern VL2.5-38B (these were evaluated locally on a workstation with 4 NVIDIA GPUs).
Software Dependencies Yes Images were generated by prompting DALL-E 3 (Betker et al., 2023), a text-to-image model developed by Open AI, through the Microsoft Azure API1 (version 2024-02-01 for all experiments). To systematically assess the impact of prompt likelihood on the validity of the generated images, we measured the plausibility of each text prompt using the GPT-3 language model from Open AI (the davinci-002-1 engine, available through the Microsoft Azure API).
Experiment Setup Yes We generated 10 images for each prompt, with the following hyperparameters: quality set to standard , style set to natural , and image size set to 1024 1024. Temperature was set to 0 when evaluating both models, top-p was set to 1, and the detail parameter was set to high .