Learning from others' mistakes: Avoiding dataset biases without modeling them
Authors: Victor Sanh, Thomas Wolf, Yonatan Belinkov, Alexander M Rush
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach in various settings ranging from toy datasets up to large crowd-sourced benchmarks: controlled synthetic bias setup (He et al., 2019; Clark et al., 2019), natural language inference (Mc Coy et al., 2019b), extractive question answering (Jia & Liang, 2017) and fact verification Schuster et al. (2019). |
| Researcher Affiliation | Collaboration | Victor Sanh1, Thomas Wolf1, Yonatan Belinkov2 , Alexander M. Rush1 1Hugging Face, 2Technion Israel Institute of Technology |
| Pseudocode | No | The paper describes its methods verbally but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper states 'Our code is based on the Hugging Face Transformers library (Wolf et al., 2019)', but does not provide a link or explicit statement about releasing the source code for the methodology described in this paper. |
| Open Datasets | Yes | MNLI (Williams et al., 2018) is the canonical large-scale English dataset to study this problem with 433K labeled examples. |
| Dataset Splits | Yes | For evaluation, it features matched sets (examples from domains encountered in training) and mismatched sets (domains not-seen during training). |
| Hardware Specification | Yes | All of our experiments are conducted on single 16GB V100 using half-precision training for speed. |
| Software Dependencies | No | The paper mentions 'Our code is based on the Hugging Face Transformers library (Wolf et al., 2019)' but does not specify its version number or versions for other key software components like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We use the following hyper-parameters: 3 epochs of training with a learning rate of 3e 5, and a batch size of 32. The learning rate is linearly increased for 2000 warming steps and linearly decreased to 0 afterward. We use an Adam optimizer β = (0.9, 0.999), ϵ = 1e 8 and add a weight decay of 0.1. |