reproducibilityindex.ai

CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Authors: Ramakrishna Vedantam, Arthur Szlam, Maximillian Nickel, Ari Morcos, Brenden M Lake

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a new benchmark, Compositional Reasoning Under Uncertainty (CURI) that instantiates a series of few-shot, meta-learning tasks in a productive concept space to evaluate different aspects of systematic generalization under uncertainty, including splits that test abstract understandings of disentangling, productive generalization, learning boolean oper ations, variable binding, etc. Importantly, we also contribute a model-independent compositional ity gap to evaluate the difﬁculty of generalizing out-of-distribution along each of these axes, al lowing objective comparison of the difﬁculty of each compositional split. Evaluations across a range of modeling choices and splits reveal sub stantial room for improvement on the proposed benchmark.
Researcher Affiliation	Collaboration	1Facebook AI Research (FAIR), USA 2New York University (NYU), USA. Correspondence to: Ramakrishna Vedantam <ramav@fb.com>.
Pseudocode	No	The paper describes the model architectures and training objectives in textual form and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	The paper describes how its custom dataset was generated (e.g., "The compositional concepts in CURI were inspired by the empirical and cognitive modeling work of Pi antadosi et al. (2016).", "yields a set of 14,929 concepts H for training and evaluation") but does not provide a specific link, DOI, or repository for public access to the dataset itself.
Dataset Splits	Yes	Altogether, for each split, our train, validation, and test sets contain 500000, 5000, and 20000 episodes respectively.
Hardware Specification	No	The paper describes the model architectures (e.g., Res Net-18 encoder, transformer, relation-network) but does not specify the particular hardware (e.g., CPU, GPU models, or cloud computing instances with their specifications) used to train or run the experiments.
Software Dependencies	No	The paper mentions PyTorch and Hydra as frameworks used but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	No	While training steps and some experimental conditions are mentioned ("All models are trained for 1 million steps, and are run with with 3 independent training runs to report standard deviations. We sweep over 3 modalities (image, schema, sound), 4 pooling schemes (avg-pool, concat, relation-net, transformer), 2 choices of negatives (hard negatives, random negatives) and choice of language (α = 0.0, 1.0)."), specific hyperparameters like learning rate, batch size, optimizer details, or more detailed training schedules are not provided in the main text.