Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices
Authors: Damien Teney, Peng Wang, Jiewei Cao, Lingqiao Liu, Chunhua Shen, Anton van den Hengel12071-12078
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate a range of deep learning architectures, and find that existing models, including those popular for vision-and-language tasks, are unable to solve seemingly-simple instances. Models using relational networks fare better but leave substantial room for improvement. |
| Researcher Affiliation | Academia | 1Australian Institute for Machine Learning The University of Adelaide Adelaide, Australia 2University of Wollongong Wollongong, Australia |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes models like MLP, GRU, VQA-like architecture, and Relation Networks using mathematical formulas and text, but not in pseudocode format. |
| Open Source Code | Yes | The dataset will be publicly released to encourage the development of models with improved capabilities for abstract reasoning over visual data. |
| Open Datasets | Yes | The annotations of object counts are extracted from numbers 1 10 appearing in natural language descriptions (e.g. five bowls of oatmeal ), manually excluding those unrelated to counts (e.g. five o clock or a 10 years old boy ). |
| Dataset Splits | Yes | We held out 8,000 instances from the training set to serve as a validation set, to select the hyperparameters and to monitor for convergence and early-stopping. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) were mentioned in the paper. |
| Experiment Setup | Yes | Suitable hyperparameters for each model were coarsely selected by grid search (details in supplementary material). We held out 8,000 instances from the training set to serve as a validation set, to select the hyperparameters and to monitor for convergence and early-stopping. Unless noted, the nonlinear transformations within the networks below refer to a linear layer followed by a Re LU. ... trained with a softmax cross-entropy loss over ˆs, standard backpropagation and SGD, using Ada Delta as the optimizer. |