CURI: A Benchmark for Productive Concept Learning Under Uncertainty
Authors: Ramakrishna Vedantam, Arthur Szlam, Maximillian Nickel, Ari Morcos, Brenden M Lake
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a new benchmark, Compositional Reasoning Under Uncertainty (CURI) that instantiates a series of few-shot, meta-learning tasks in a productive concept space to evaluate different aspects of systematic generalization under uncertainty, including splits that test abstract understandings of disentangling, productive generalization, learning boolean oper ations, variable binding, etc. Importantly, we also contribute a model-independent compositional ity gap to evaluate the difficulty of generalizing out-of-distribution along each of these axes, al lowing objective comparison of the difficulty of each compositional split. Evaluations across a range of modeling choices and splits reveal sub stantial room for improvement on the proposed benchmark. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research (FAIR), USA 2New York University (NYU), USA. Correspondence to: Ramakrishna Vedantam <ramav@fb.com>. |
| Pseudocode | No | The paper describes the model architectures and training objectives in textual form and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes how its custom dataset was generated (e.g., "The compositional concepts in CURI were inspired by the empirical and cognitive modeling work of Pi antadosi et al. (2016).", "yields a set of 14,929 concepts H for training and evaluation") but does not provide a specific link, DOI, or repository for public access to the dataset itself. |
| Dataset Splits | Yes | Altogether, for each split, our train, validation, and test sets contain 500000, 5000, and 20000 episodes respectively. |
| Hardware Specification | No | The paper describes the model architectures (e.g., Res Net-18 encoder, transformer, relation-network) but does not specify the particular hardware (e.g., CPU, GPU models, or cloud computing instances with their specifications) used to train or run the experiments. |
| Software Dependencies | No | The paper mentions PyTorch and Hydra as frameworks used but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | While training steps and some experimental conditions are mentioned ("All models are trained for 1 million steps, and are run with with 3 independent training runs to report standard deviations. We sweep over 3 modalities (image, schema, sound), 4 pooling schemes (avg-pool, concat, relation-net, transformer), 2 choices of negatives (hard negatives, random negatives) and choice of language (α = 0.0, 1.0)."), specific hyperparameters like learning rate, batch size, optimizer details, or more detailed training schedules are not provided in the main text. |