Learning from others' mistakes: Avoiding dataset biases without modeling them

Authors: Victor Sanh, Thomas Wolf, Yonatan Belinkov, Alexander M Rush

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach in various settings ranging from toy datasets up to large crowd-sourced benchmarks: controlled synthetic bias setup (He et al., 2019; Clark et al., 2019), natural language inference (Mc Coy et al., 2019b), extractive question answering (Jia & Liang, 2017) and fact verification Schuster et al. (2019).
Researcher Affiliation Collaboration Victor Sanh1, Thomas Wolf1, Yonatan Belinkov2 , Alexander M. Rush1 1Hugging Face, 2Technion Israel Institute of Technology
Pseudocode No The paper describes its methods verbally but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper states 'Our code is based on the Hugging Face Transformers library (Wolf et al., 2019)', but does not provide a link or explicit statement about releasing the source code for the methodology described in this paper.
Open Datasets Yes MNLI (Williams et al., 2018) is the canonical large-scale English dataset to study this problem with 433K labeled examples.
Dataset Splits Yes For evaluation, it features matched sets (examples from domains encountered in training) and mismatched sets (domains not-seen during training).
Hardware Specification Yes All of our experiments are conducted on single 16GB V100 using half-precision training for speed.
Software Dependencies No The paper mentions 'Our code is based on the Hugging Face Transformers library (Wolf et al., 2019)' but does not specify its version number or versions for other key software components like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We use the following hyper-parameters: 3 epochs of training with a learning rate of 3e 5, and a batch size of 32. The learning rate is linearly increased for 2000 warming steps and linearly decreased to 0 afterward. We use an Adam optimizer β = (0.9, 0.999), ϵ = 1e 8 and add a weight decay of 0.1.