First is Better Than Last for Language Data Influence
Authors: Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, Pradeep Ravikumar
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Trac In-WE significantly outperforms other data influence methods applied on the last layer significantly on the case deletion evaluation on three language classification tasks for different models. In addition, Trac In-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging. |
| Researcher Affiliation | Collaboration | Chih-Kuan Yeh Google Inc. chihkuanyeh@google.com Alon Taly Google Inc. ataly@google.com Mukund Sundararajan Google Inc. mukunds@google.com Frederick Liu Google Inc. frederickliu@google.com Pradeep Ravikumar Carnegie Mellon University Department of Machine Learning pradeepr@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code blocks. |
| Open Source Code | Yes | 2code is in https://github.com/chihkuanyeh/Trac In-WE. |
| Open Datasets | Yes | We first experiment on the toxicity comment classification dataset (Kaggle.com, 2018), which contains sentences that are labeled toxic or non-toxic. [...] We next experiment on the AG-news-subset (Gulli, 2015; Zhang et al., 2015), which contains a corpus of news with 4 different classes. [...] Finally, we test on a larger scale dataset, Multi-Genre Natural Language Inference (Multi NLI) Williams et al. (2018), which consists of 433k sentence pairs with textual entailment information, including entailment, neutral, and contradiction. |
| Dataset Splits | Yes | We randomly choose 50, 000 training samples and 20, 000 validation samples. [...] We follow our setting in toxicity and choose 50, 000 training samples, 20, 000 validation samples [...] In this experiment, we use the full training and validation set |
| Hardware Specification | Yes | All experiments were run on a server with 1 NVIDIA V100 GPU and 60GB of RAM. |
| Software Dependencies | No | The paper mentions using 'BERT models' and 'Roberta model' and that 'Pytorch' is used for Trac In, but it does not specify exact version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We then fine-tune a BERT-small model on our training set, which leads to 96% accuracy. [...] For each example x0 in the test set, we remove top-k proponents and top-k opponents in the training set respectively, and retrain the model to obtain DEL+(x0, k, I) and DEL (x0, k, I) for each influence method I. We vary k over {10, 20, . . . , 100}. For each k, we retrain the model 10 times and take the average result, and then average over the 40 test points. |