Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Authors: Arsenii Kirillovich Moskvichev, Victor Vikram Odouard, Melanie Mitchell

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions in this paper are (1) the creation of a new concept-based evaluation benchmark for the ARC domain and (2) results from our studies using this benchmark to evaluate state-of-the-art programs that solve ARC problems, as well as human performance on this benchmark.
Researcher Affiliation	Academia	Arseny Moskvichev EMAIL Santa Fe Institute Victor Vikram Odouard EMAIL Santa Fe Institute Melanie Mitchell EMAIL Santa Fe institute
Pseudocode	No	The paper describes methods used by other programs (e.g., heuristic search, genetic algorithm) but does not provide pseudocode or algorithm blocks for its own methodology for creating the benchmark or conducting the experiments.
Open Source Code	No	The paper provides a link to the Concept ARC tasks and results, which are data/benchmark instances, not the source code for the methodology (how the tasks were generated or the experimental evaluation framework) described in the paper.
Open Datasets	Yes	All Concept ARC tasks can be downloaded from https://github.com/victorvikram/Concept ARC.
Dataset Splits	No	The paper introduces a new benchmark (Concept ARC) which serves as a test set for evaluation. It describes the structure of this benchmark (16 concept groups, 10 tasks per group, 3 test inputs per task), and how human participants were exposed to tasks, but does not provide traditional training/validation/test splits for machine learning models within its own experimental framework.
Hardware Specification	No	The paper does not specify the hardware used to run the experiments, such as specific CPU or GPU models. It mentions using the API for GPT-4 but does not detail the local hardware used to interact with the API or run the other programs.
Software Dependencies	No	The paper mentions 'psi Turk framework' and 'Open AI's API' but does not provide specific version numbers for these or any other key software components used in the experiments.
Experiment Setup	Yes	To test this language-only version of GPT-4 on the tasks in Concept ARC, we used the API provided by Open AI. In our experiments the model name was set to gpt-4 , the temperature was set to 0 or 0.5, and other parameters were left at their default values. In the case where temperature was set to 0.5, we repeated each task prompt three times, and if at least one of outputs was correct, we considered the task to be solved correctly. For human studies: Each participant was presented with a random selection of tasks (17 for most participants...), participants were given three attempts to solve each test input. Exclusion criteria were also detailed.