Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
Authors: Arsenii Kirillovich Moskvichev, Victor Vikram Odouard, Melanie Mitchell
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions in this paper are (1) the creation of a new concept-based evaluation benchmark for the ARC domain and (2) results from our studies using this benchmark to evaluate state-of-the-art programs that solve ARC problems, as well as human performance on this benchmark. |
| Researcher Affiliation | Academia | Arseny Moskvichev EMAIL Santa Fe Institute Victor Vikram Odouard EMAIL Santa Fe Institute Melanie Mitchell EMAIL Santa Fe institute |
| Pseudocode | No | The paper describes methods used by other programs (e.g., heuristic search, genetic algorithm) but does not provide pseudocode or algorithm blocks for its own methodology for creating the benchmark or conducting the experiments. |
| Open Source Code | No | The paper provides a link to the Concept ARC tasks and results, which are data/benchmark instances, not the source code for the methodology (how the tasks were generated or the experimental evaluation framework) described in the paper. |
| Open Datasets | Yes | All Concept ARC tasks can be downloaded from https://github.com/victorvikram/Concept ARC. |
| Dataset Splits | No | The paper introduces a new benchmark (Concept ARC) which serves as a test set for evaluation. It describes the structure of this benchmark (16 concept groups, 10 tasks per group, 3 test inputs per task), and how human participants were exposed to tasks, but does not provide traditional training/validation/test splits for machine learning models within its own experimental framework. |
| Hardware Specification | No | The paper does not specify the hardware used to run the experiments, such as specific CPU or GPU models. It mentions using the API for GPT-4 but does not detail the local hardware used to interact with the API or run the other programs. |
| Software Dependencies | No | The paper mentions 'psi Turk framework' and 'Open AI's API' but does not provide specific version numbers for these or any other key software components used in the experiments. |
| Experiment Setup | Yes | To test this language-only version of GPT-4 on the tasks in Concept ARC, we used the API provided by Open AI. In our experiments the model name was set to gpt-4 , the temperature was set to 0 or 0.5, and other parameters were left at their default values. In the case where temperature was set to 0.5, we repeated each task prompt three times, and if at least one of outputs was correct, we considered the task to be solved correctly. For human studies: Each participant was presented with a random selection of tasks (17 for most participants...), participants were given three attempts to solve each test input. Exclusion criteria were also detailed. |