On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Authors: Damien Teney, Ehsan Abbasnejad, Kushal Kafle, Robik Shrestha, Christopher Kanan, Anton van den Hengel
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide shortand long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation. |
| Researcher Affiliation | Collaboration | 1Australian Institute for Machine Learning, University of Adelaide, Australia 2Adobe Research 3Rochester Institute of Technology |
| Pseudocode | No | The paper describes methods in text and equations but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | The VQA-CP dataset (for Changing Priors) was designed to evaluate VQA models in a setting where they cannot rely on language biases. The dataset was built by reorganizing the training/test splits of VQA v2 as follows. The questions are assigned to one of 65 question types according to their prefix (first few words). The prefixes were defined in [20]. All question/image/answer triplets are then clustered according to the combination of prefix and answer. |
| Dataset Splits | Yes | On VQA-CP, we hold out 8,000 instances from the training set (VQA-CP val.) to measure in-domain performance as proposed in [22, 14, 41]. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided in the paper. |
| Software Dependencies | No | No specific software dependencies or versions (e.g., programming languages, libraries, frameworks with version numbers) are mentioned in the paper. |
| Experiment Setup | Yes | The regularizer weight λ allows tuning the trade-off between in-domain and OOD performance. We plot in Fig. 4 and Fig. 5 (in the supp. mat.) the accuracy as a function of λ. |