reproducibilityindex.ai

Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

Authors: Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty w.r.t. a model V as the lack of V-usable information (Xu et al., 2019)...Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs.Table 1 shows the most difficult (lowest PVI) instances from Co LA.
Researcher Affiliation	Collaboration	1Stanford University 2Allen Institute for Artificial Intelligence 3Paul G. Allen School of Computer Science, University of Washington.
Pseudocode	Yes	Algorithm 1: After finetuning on a dataset of size n, the V-information and PVI can be calculated in O(n) time. Input: training data Dtrain = {(input xi, gold label yi)}m i=1, heldout data Dtest = {(input xi, gold label yi)}n i=1, model V do g Finetune V on Dtrain empty string (null input) g Finetune V on {( , yi) \| (xi, yi) Dtrain} HV(Y ), HV(Y \|X) 0, 0 for (xi, yi) Dtest do HV(Y ) HV(Y ) 1n log2 g[ ](yi) HV(Y \|X) HV(Y \|X) 1n log2 g [xi](yi) PVI(xi yi) log2 g[ ](yi) + log2 g [xi](yi) end for ˆIV(X Y ) = 1i PVI(xi yi) = HV(Y ) HV(Y \|X) end do
Open Source Code	Yes	Our code and data are available here.
Open Datasets	Yes	We consider the natural language inference (NLI) task, which involves predicting whether a text hypothesis entails, contradicts or is neutral to a text premise. We first apply the V-information framework to estimate the difficulty of a largescale NLI dataset, Stanford NLI (SNLI; Bowman et al., 2015)... We consider the Multi NLI dataset (Williams et al., 2018)... Also shown is Co LA (Warstadt et al., 2018)... as well as DWMW17 (Davidson et al., 2017), a dataset for hate speech detection...
Dataset Splits	No	The paper mentions 'train' and 'held-out (test)' sets (e.g., 'Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs.'), but it does not explicitly describe a separate 'validation' split.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, or cloud computing instances with specifications) used to run the experiments.
Software Dependencies	No	The paper mentions specific models like 'GPT2-small', 'BERT-base-cased', 'Distil BERT-base-uncased', 'BART-base', and 'RoBERTa-large', and also 'spaCy’s built-in sentiment classifier', but it does not provide specific version numbers for the software libraries, frameworks, or programming languages used (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup	Yes	Figure 2 shows the V-information estimate for all four, as well as their accuracy on the SNLI train and heldout (test) sets, across 10 training epochs. Training with the cross-entropy loss finds the f V that maximizes the log-likelihood of Y given X (Xu et al., 2019).