Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study harmful imitation through the lens of a model s internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. [...] To investigate this, we set up a contrast task, where models are provided either correct or incorrect labels for few-shot classification (Figure 1, left). We study the difference between these two settings by decoding from successively later layers of the residual stream (Nostalgebraist, 2020) (Figure 1, center). |
| Researcher Affiliation | Academia | Danny Halawi , Jean-Stanislas Denain , and Jacob Steinhardt UC Berkeley {dhalawi,js_denain,jsteinhardt}@berkeley.edu |
| Pseudocode | No | The paper does not include any pseudocode or algorithm blocks. It describes the methods in narrative form and with mathematical equations. |
| Open Source Code | Yes | All code needed to reproduce our results can be found at https://github.com/dannyallover/ overthinking_the_truth |
| Open Datasets | Yes | We consider fourteen text classification datasets: SST-2 (Socher et al., 2013), Poem Sentiment (Sheng & Uthus, 2020), Financial Phrasebank (Malo et al., 2014), Ethos (Mollas et al., 2020), Tweet Eval-Hate, -Atheism, and -Feminist (Barbieri et al., 2020), Medical Questions Pairs (Mc Creery et al., 2020), MRPC (Wang et al., 2019), SICK (Marelli et al., 2014), RTE (Wang et al., 2019), AGNews (Zhang et al., 2015), TREC (Voorhees & Tice, 2000), and DBpedia (Zhang et al., 2015). |
| Dataset Splits | No | The paper describes using few-shot learning and evaluation metrics like 'calibrated classification accuracy' and 'sampling demonstration labels with equal probability', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) or refer to standard predefined splits for the experiments performed. |
| Hardware Specification | No | The paper does not specify the hardware used for its experiments (e.g., specific GPU or CPU models, memory sizes). It mentions evaluating models like GPT-J-6B, GPT2-XL-1.5B, etc., but not the computational resources used to run them. |
| Software Dependencies | No | The paper mentions evaluating models like GPT-J-6B, GPT2-XL, and Pythia models. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | No | The paper mentions 'sampling k demonstrations' and 'vary k from 0 to 40', and 'sample 1000 sequences'. It also describes the 'logit lens method' and 'ablating attention heads'. However, it does not provide specific hyperparameter values like learning rates, batch sizes, number of epochs, or optimizer settings, which are typically found in experimental setup details. |