Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Measuring CLEVRness: Black-box Testing of Visual Reasoning Models
Authors: Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that CLEVR models, which otherwise could perform at a human level , can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning. |
| Researcher Affiliation | Collaboration | Spyridon Mouselinos University of Warsaw Warsaw, Poland EMAIL Henryk Michalewski University of Warsaw, Google Oxford, U.K. EMAIL Mateusz Malinowski Deep Mind London, U.K. EMAIL |
| Pseudocode | Yes | A.5 ALGORITHMS We show pseudo-algorithms that we use to (Algorithm 1) calculate rewards, (Algorithm 2) train Adversarial Player, (Algorithm 3) and play a game. |
| Open Source Code | No | Table 3 shows the URLs to models used in our investigations (also Table 1 in the main paper). We also report if we re-trained a model from scratch (type Architecture) or used already trained models (type Model). Please note that the latter type proves that our testing procedure is fully black-box. |
| Open Datasets | Yes | CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets. |
| Dataset Splits | Yes | CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets. |
| Hardware Specification | No | All experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and ERC Starting Grant TOTAL. |
| Software Dependencies | Yes | For the generation of new images/scenes we use the open-source Blender Graphics Engine 2 (v2.79b), and the original 3D models of the CLEVR dataset. |
| Experiment Setup | Yes | We discretize the scene where each axis has values in [ 3, 3] onto N = 7 bins per axis. [...] We use the following values: dr= 1, cr= 0.1, fr = 0.1, isr = 0.8. [...] To train Adversarial Player we use the A2C algorithm with the episode length set to one... We experiment with the following Mini-game sizes 10, 100, 1000. |