reproducibilityindex.ai

Post Hoc Explanations of Language Models Can Improve Language Models

Authors: Satyapriya Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimentation with real-world datasets demonstrates that our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks, including those where prior approaches which rely on human-annotated rationales such as Chain-of-Thought prompting fall short. Our findings demonstrate that AMPLIFY leads to performance improvements of about 10-25% across a wide range of tasks, including those where previously considered prompting techniques such as Chain-of-Thought prompting which rely on human-annotated explanations, fall short.
Researcher Affiliation	Collaboration	Satyapriya Krishna1, Jiaqi Ma2, Dylan Slack3, Asma Ghandeharioun4, Sameer Singh3, and Himabindu Lakkaraju1 1Harvard University 2University of Illinois Urbana-Champaign 3University of California, Irvine 4Google Inc
Pseudocode	No	The paper describes a four-step approach in text and a flowchart (Figure 1), but does not present formal pseudocode or an algorithm block.
Open Source Code	No	The paper mentions using open-sourced models like GPT-2 and BERT, but there is no explicit statement or link indicating that the authors have open-sourced the code for their AMPLIFY framework.
Open Datasets	Yes	We evaluate our framework AMPLIFY on some of the popular datasets from the Big Bench-Hard[29] benchmark. More specifically, we experiment with: (1) The Snarks[29] dataset which gauges a model s proficiency in discerning sarcastic sentences from a selection of alternatives; (2) The Causal Judgment[29] dataset, designed to evaluate a model s ability in accurately deducing the causative factors of an event from a detailed summary; (3) The Ruin Names[29] task, which involves the identification of comical modifications to artist or movie names; (4) The Formal Fallacies[29] task, where machine learning models are put to the test to distinguish between logically sound arguments and those with logical discrepancies; (5) The Salient Translation Error Detection[29] task, engineered to train models in identifying one out of six predetermined translation errors given a pair of translations; (6) The Commonsense QA [32] dataset, a multiple-choice question answering platform that necessitates a wide variety of commonsense knowledge for accurately determining the correct responses; (7) Lastly, the Coin Flip [35] dataset, a synthetically generated dataset used for assessing the symbolic reasoning capacity of LLMs.
Dataset Splits	Yes	By leveraging the validation set, we identify samples that were misclassified by the LLM. To this end, we first identify instances from the validation set that are misclassified by the LLM. We then rank these instances using a metric we introduce called the Misclassification Confidence Score (MCS).
Hardware Specification	No	The paper discusses the scale of the language models used and their computational intensity, but it does not specify the particular hardware (e.g., GPU models, CPU types) used for conducting their experiments.
Software Dependencies	No	The paper mentions various models and explanation methods (GPT-2, BERT, Gradient x Input), but does not provide specific version numbers for the software dependencies (e.g., Python, deep learning frameworks, or libraries) used for implementation.
Experiment Setup	Yes	In the case of AMPLIFY, we employed GPT-2[22] fine-tuned for the target task as the proxy model for step 1, unless stated otherwise. We utilized a rationale template with k = 5, which is of the form: 'The key words: word1, word2, ...and word5 are important clues to predict [ground truth label] as the correct answer'. To compute these attribution scores, we used Gradient x Input as the default post hoc explanation method for generating explanations. Recall that AMPLIFY has two other primary hyper-parameters apart from the rationale template choice discussed in our empirical findings, namely, s, which is the size of the few-shot prompt created for LLMs, and k, which is the number of most important tokens identified by the post hoc explanation.