Identifying and Benchmarking Natural Out-of-Context Prediction Problems
Authors: David Madras, Richard Zemel
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions. |
| Researcher Affiliation | Collaboration | David Madras University of Toronto Vector Institute madras@cs.toronto.edu Richard Zemel University of Toronto Vector Institute Columbia University zemel@cs.toronto.edu |
| Pseudocode | No | The paper describes algorithms and methods in text, but it does not provide any explicitly labeled |
| Open Source Code | Yes | We present NOOCH (Naturally-Occurring Out-of-context Challenge sets), a suite of challenge sets for evaluating performance on naturally-arising OOC problems, available at https://github.com/dmadras/nooch; |
| Open Datasets | Yes | Background: COCO and COCO-Stuff. The Microsoft Common Objects in COntext dataset (COCO) [36] is a computer vision dataset... Fortunately, the COCO-Stuff dataset [7] provides labels... For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53]. |
| Dataset Splits | Yes | Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set. |
| Hardware Specification | No | The paper states, |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53]. We train binary classifiers to minimize average NLL on each of the 171 classes in COCO-Stuff. For the environment-based methods, we follow Sagawa et al. [54] and create 4 environments: 1 for each element of the cross-product of the label and its highest-α context class. Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set. |