Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

Authors: Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, Anima Anandkumar

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that the state-of-the-art deep learning methods perform substantially worse than human subjects, implying that they fail to capture core human cognition properties. We split the 12,000 problems in BONGARD-LOGO into the disjoint train/validation/test sets, consisting of 9300, 900, and 1800 problems respectively. We report the test accuracy (Acc) of different methods on each of the four test sets respectively, and compare the results to the human performance in Table 1.
Researcher Affiliation Collaboration Weili Nie Rice University wn8@rice.edu Zhiding Yu NVIDIA zhidingy@nvidia.com Lei Mao NVIDIA lmao@nvidia.com Ankit B. Patel Rice University Baylor College of Medicine abp4@rice.edu Yuke Zhu UT Austin NVIDIA yukez@cs.utexas.edu Animashree Anandkumar Caltech NVIDIA anima@caltech.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We have open-sourced the procedural generation code and data of BONGARD-LOGO in the following Git Hub repository: https://github.com/NVlabs/Bongard-LOGO.
Open Datasets Yes We developed the BONGARD-LOGO benchmark that shares the same purposes as the original BPs for human-level visual concept learning and reasoning. Meanwhile, it contains a large quantity of 12,000 problems and transforms concept learning into a few-shot binary classification problem. We have open-sourced the procedural generation code and data of BONGARD-LOGO in the following Git Hub repository: https://github.com/NVlabs/Bongard-LOGO.
Dataset Splits Yes We split the 12,000 problems in BONGARD-LOGO into the disjoint train/validation/test sets, consisting of 9300, 900, and 1800 problems respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup No The paper states: 'We put the experiment setup for training these methods to Appendix C'. This means the detailed setup is not in the main text.