reproducibilityindex.ai

First is Better Than Last for Language Data Influence

Authors: Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, Pradeep Ravikumar

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Trac In-WE significantly outperforms other data influence methods applied on the last layer significantly on the case deletion evaluation on three language classification tasks for different models. In addition, Trac In-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.
Researcher Affiliation	Collaboration	Chih-Kuan Yeh Google Inc. chihkuanyeh@google.com Alon Taly Google Inc. ataly@google.com Mukund Sundararajan Google Inc. mukunds@google.com Frederick Liu Google Inc. frederickliu@google.com Pradeep Ravikumar Carnegie Mellon University Department of Machine Learning pradeepr@cs.cmu.edu
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code blocks.
Open Source Code	Yes	2code is in https://github.com/chihkuanyeh/Trac In-WE.
Open Datasets	Yes	We first experiment on the toxicity comment classification dataset (Kaggle.com, 2018), which contains sentences that are labeled toxic or non-toxic. [...] We next experiment on the AG-news-subset (Gulli, 2015; Zhang et al., 2015), which contains a corpus of news with 4 different classes. [...] Finally, we test on a larger scale dataset, Multi-Genre Natural Language Inference (Multi NLI) Williams et al. (2018), which consists of 433k sentence pairs with textual entailment information, including entailment, neutral, and contradiction.
Dataset Splits	Yes	We randomly choose 50, 000 training samples and 20, 000 validation samples. [...] We follow our setting in toxicity and choose 50, 000 training samples, 20, 000 validation samples [...] In this experiment, we use the full training and validation set
Hardware Specification	Yes	All experiments were run on a server with 1 NVIDIA V100 GPU and 60GB of RAM.
Software Dependencies	No	The paper mentions using 'BERT models' and 'Roberta model' and that 'Pytorch' is used for Trac In, but it does not specify exact version numbers for any of these software components or libraries.
Experiment Setup	Yes	We then fine-tune a BERT-small model on our training set, which leads to 96% accuracy. [...] For each example x0 in the test set, we remove top-k proponents and top-k opponents in the training set respectively, and retrain the model to obtain DEL+(x0, k, I) and DEL (x0, k, I) for each influence method I. We vary k over {10, 20, . . . , 100}. For each k, we retrain the model 10 times and take the average result, and then average over the 40 test points.