Identifying and Benchmarking Natural Out-of-Context Prediction Problems

Authors: David Madras, Richard Zemel

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.
Researcher Affiliation Collaboration David Madras University of Toronto Vector Institute madras@cs.toronto.edu Richard Zemel University of Toronto Vector Institute Columbia University zemel@cs.toronto.edu
Pseudocode No The paper describes algorithms and methods in text, but it does not provide any explicitly labeled
Open Source Code Yes We present NOOCH (Naturally-Occurring Out-of-context Challenge sets), a suite of challenge sets for evaluating performance on naturally-arising OOC problems, available at https://github.com/dmadras/nooch;
Open Datasets Yes Background: COCO and COCO-Stuff. The Microsoft Common Objects in COntext dataset (COCO) [36] is a computer vision dataset... Fortunately, the COCO-Stuff dataset [7] provides labels... For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53].
Dataset Splits Yes Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set.
Hardware Specification No The paper states,
Software Dependencies No The paper mentions using
Experiment Setup Yes For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53]. We train binary classifiers to minimize average NLL on each of the 171 classes in COCO-Stuff. For the environment-based methods, we follow Sagawa et al. [54] and create 4 environments: 1 for each element of the cross-product of the label and its highest-α context class. Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set.