Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information
Authors: Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty w.r.t. a model V as the lack of V-usable information (Xu et al., 2019)...Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs.Table 1 shows the most difficult (lowest PVI) instances from Co LA. |
| Researcher Affiliation | Collaboration | 1Stanford University 2Allen Institute for Artificial Intelligence 3Paul G. Allen School of Computer Science, University of Washington. |
| Pseudocode | Yes | Algorithm 1: After finetuning on a dataset of size n, the V-information and PVI can be calculated in O(n) time. Input: training data Dtrain = {(input xi, gold label yi)}m i=1, heldout data Dtest = {(input xi, gold label yi)}n i=1, model V do g Finetune V on Dtrain empty string (null input) g Finetune V on {( , yi) | (xi, yi) Dtrain} HV(Y ), HV(Y |X) 0, 0 for (xi, yi) Dtest do HV(Y ) HV(Y ) 1n log2 g[ ](yi) HV(Y |X) HV(Y |X) 1n log2 g [xi](yi) PVI(xi yi) log2 g[ ](yi) + log2 g [xi](yi) end for ˆIV(X Y ) = 1i PVI(xi yi) = HV(Y ) HV(Y |X) end do |
| Open Source Code | Yes | Our code and data are available here. |
| Open Datasets | Yes | We consider the natural language inference (NLI) task, which involves predicting whether a text hypothesis entails, contradicts or is neutral to a text premise. We first apply the V-information framework to estimate the difficulty of a largescale NLI dataset, Stanford NLI (SNLI; Bowman et al., 2015)... We consider the Multi NLI dataset (Williams et al., 2018)... Also shown is Co LA (Warstadt et al., 2018)... as well as DWMW17 (Davidson et al., 2017), a dataset for hate speech detection... |
| Dataset Splits | No | The paper mentions 'train' and 'held-out (test)' sets (e.g., 'Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs.'), but it does not explicitly describe a separate 'validation' split. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, or cloud computing instances with specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions specific models like 'GPT2-small', 'BERT-base-cased', 'Distil BERT-base-uncased', 'BART-base', and 'RoBERTa-large', and also 'spaCy’s built-in sentiment classifier', but it does not provide specific version numbers for the software libraries, frameworks, or programming languages used (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs. Training with the cross-entropy loss finds the f V that maximizes the log-likelihood of Y given X (Xu et al., 2019). |