Measuring CLEVRness: Black-box Testing of Visual Reasoning Models

Authors: Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that CLEVR models, which otherwise could perform at a human level , can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.
Researcher Affiliation Collaboration Spyridon Mouselinos University of Warsaw Warsaw, Poland s.mouselinos@uw.edu.pl Henryk Michalewski University of Warsaw, Google Oxford, U.K. henrykm@google.com Mateusz Malinowski Deep Mind London, U.K. mateuszm@deepmind.com
Pseudocode Yes A.5 ALGORITHMS We show pseudo-algorithms that we use to (Algorithm 1) calculate rewards, (Algorithm 2) train Adversarial Player, (Algorithm 3) and play a game.
Open Source Code No Table 3 shows the URLs to models used in our investigations (also Table 1 in the main paper). We also report if we re-trained a model from scratch (type Architecture) or used already trained models (type Model). Please note that the latter type proves that our testing procedure is fully black-box.
Open Datasets Yes CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Dataset Splits Yes CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Hardware Specification No All experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and ERC Starting Grant TOTAL.
Software Dependencies Yes For the generation of new images/scenes we use the open-source Blender Graphics Engine 2 (v2.79b), and the original 3D models of the CLEVR dataset.
Experiment Setup Yes We discretize the scene where each axis has values in [ 3, 3] onto N = 7 bins per axis. [...] We use the following values: dr= 1, cr= 0.1, fr = 0.1, isr = 0.8. [...] To train Adversarial Player we use the A2C algorithm with the episode length set to one... We experiment with the following Mini-game sizes 10, 100, 1000.