Measuring CLEVRness: Black-box Testing of Visual Reasoning Models
Authors: Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that CLEVR models, which otherwise could perform at a human level , can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning. |
| Researcher Affiliation | Collaboration | Spyridon Mouselinos University of Warsaw Warsaw, Poland s.mouselinos@uw.edu.pl Henryk Michalewski University of Warsaw, Google Oxford, U.K. henrykm@google.com Mateusz Malinowski Deep Mind London, U.K. mateuszm@deepmind.com |
| Pseudocode | Yes | A.5 ALGORITHMS We show pseudo-algorithms that we use to (Algorithm 1) calculate rewards, (Algorithm 2) train Adversarial Player, (Algorithm 3) and play a game. |
| Open Source Code | No | Table 3 shows the URLs to models used in our investigations (also Table 1 in the main paper). We also report if we re-trained a model from scratch (type Architecture) or used already trained models (type Model). Please note that the latter type proves that our testing procedure is fully black-box. |
| Open Datasets | Yes | CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets. |
| Dataset Splits | Yes | CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets. |
| Hardware Specification | No | All experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and ERC Starting Grant TOTAL. |
| Software Dependencies | Yes | For the generation of new images/scenes we use the open-source Blender Graphics Engine 2 (v2.79b), and the original 3D models of the CLEVR dataset. |
| Experiment Setup | Yes | We discretize the scene where each axis has values in [ 3, 3] onto N = 7 bins per axis. [...] We use the following values: dr= 1, cr= 0.1, fr = 0.1, isr = 0.8. [...] To train Adversarial Player we use the A2C algorithm with the episode length set to one... We experiment with the following Mini-game sizes 10, 100, 1000. |