First is Better Than Last for Language Data Influence

Authors: Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, Pradeep Ravikumar

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Trac In-WE significantly outperforms other data influence methods applied on the last layer significantly on the case deletion evaluation on three language classification tasks for different models. In addition, Trac In-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.
Researcher Affiliation Collaboration Chih-Kuan Yeh Google Inc. chihkuanyeh@google.com Alon Taly Google Inc. ataly@google.com Mukund Sundararajan Google Inc. mukunds@google.com Frederick Liu Google Inc. frederickliu@google.com Pradeep Ravikumar Carnegie Mellon University Department of Machine Learning pradeepr@cs.cmu.edu
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code blocks.
Open Source Code Yes 2code is in https://github.com/chihkuanyeh/Trac In-WE.
Open Datasets Yes We first experiment on the toxicity comment classification dataset (Kaggle.com, 2018), which contains sentences that are labeled toxic or non-toxic. [...] We next experiment on the AG-news-subset (Gulli, 2015; Zhang et al., 2015), which contains a corpus of news with 4 different classes. [...] Finally, we test on a larger scale dataset, Multi-Genre Natural Language Inference (Multi NLI) Williams et al. (2018), which consists of 433k sentence pairs with textual entailment information, including entailment, neutral, and contradiction.
Dataset Splits Yes We randomly choose 50, 000 training samples and 20, 000 validation samples. [...] We follow our setting in toxicity and choose 50, 000 training samples, 20, 000 validation samples [...] In this experiment, we use the full training and validation set
Hardware Specification Yes All experiments were run on a server with 1 NVIDIA V100 GPU and 60GB of RAM.
Software Dependencies No The paper mentions using 'BERT models' and 'Roberta model' and that 'Pytorch' is used for Trac In, but it does not specify exact version numbers for any of these software components or libraries.
Experiment Setup Yes We then fine-tune a BERT-small model on our training set, which leads to 96% accuracy. [...] For each example x0 in the test set, we remove top-k proponents and top-k opponents in the training set respectively, and retrain the model to obtain DEL+(x0, k, I) and DEL (x0, k, I) for each influence method I. We vary k over {10, 20, . . . , 100}. For each k, we retrain the model 10 times and take the average result, and then average over the 40 test points.