Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Minimal Targeted Updates of Language Models with Targeted Negative Training
Authors: Lily H Zhang, Rajesh Ranganath, Arya Tafvizi
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We consider two use cases for targeted negative training, reducing hallucinations and toxicity. All experiments utilize T5 base (220M parameters). First, we finetune T5 on the original training set. Then, we generate from the model given training and validation inputs and annotate the generations. Next, we use the annotated generations to update the model. To evaluate, we compute the prevalence of the unwanted behavior among the new model s generations on the test inputs, as well as similarity between the old and new model s generations. |
| Researcher Affiliation | Collaboration | Lily H. Zhang EMAIL New York University Rajesh Ranganath EMAIL New York University Arya Tafvizi EMAIL Google |
| Pseudocode | Yes | Algorithm 1 Targeted Negative Training 1: Input: initial model po (already trained), inputs {c}n 1, model outputs {x}n 1, token annotations {a}n 1 denoting xt supp(pneg c,x<t) 2: pm po 3: for each iteration do 4: Get pm c,x<t for all c, x<t in batch (forward pass of pm) 5: Get po c,x<t for all c, x<t in batch (forward pass of po) 6: Compute pnew c,x<t for all c, x<t in batch (Equation (2)) 7: Calculate tnt loss (Equation (4)) 8: Calculate gradients for weights in pm and update pm 9: end for 10: Return pm |
| Open Source Code | Yes | Code for tnt can be found at https://github.com/google/t5patches. |
| Open Datasets | Yes | We use the XSUM dataset (Narayan et al., 2018) for the reducing hallucination task and Civil Comments (Borkan et al., 2019) for the reducing offensive phrases task. ... To label text spans as toxic, we train a token-level toxicity classifier on the Civil Comments Spans dataset Pavlopoulos et al. (2021). |
| Dataset Splits | Yes | For the hallucination experiment, we use the XSUM train, validation, and test splits. The dataset sizes for train, validation, and test are 203,577, 11,305, and 11,301. ... The resulting train, validation, and test (unused) sets are of size 175,754, 21,974, and 22,009. |
| Hardware Specification | Yes | For all experiments, we use Google Cloud v4 TPU pods. |
| Software Dependencies | No | The paper mentions using "Spacy's CNN-based named entity recognition (NER) model" but does not provide a specific version number. No other specific software dependencies with version numbers are mentioned. |
| Experiment Setup | Yes | For all runs, we use a batch size of 32, dropout rate of 0.1, and no label smoothing. For all runs, the cross entropy loss includes the square of the logsumexp of the logits as a penalty, scaled by a factor of 0.0001. ... For the initial finetuning, we train a base T5 model with learning rate 1e-3 and select the best checkpoint every 10,000 steps based on validation loss. Our resulting models are finetuned for 30,000 steps on XSUM and 40,000 steps on Civil Comments. For the updates and alternative finetuning, we run a sweep across four different learning rates (1e-3, 1e-4, 1e-5, 1e-6) and choose the best model per every 1,000 steps based on validation loss. We run updates for a total of 100,000 steps for the T5 model, and 200,000 steps for the the Pa LM-2 1b model. The learning rates used for the various methods are as follows: (Table 2 provides specific learning rates for each method). |