Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study harmful imitation through the lens of a model s internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. [...] To investigate this, we set up a contrast task, where models are provided either correct or incorrect labels for few-shot classification (Figure 1, left). We study the difference between these two settings by decoding from successively later layers of the residual stream (Nostalgebraist, 2020) (Figure 1, center).
Researcher Affiliation Academia Danny Halawi , Jean-Stanislas Denain , and Jacob Steinhardt UC Berkeley {dhalawi,js_denain,jsteinhardt}@berkeley.edu
Pseudocode No The paper does not include any pseudocode or algorithm blocks. It describes the methods in narrative form and with mathematical equations.
Open Source Code Yes All code needed to reproduce our results can be found at https://github.com/dannyallover/ overthinking_the_truth
Open Datasets Yes We consider fourteen text classification datasets: SST-2 (Socher et al., 2013), Poem Sentiment (Sheng & Uthus, 2020), Financial Phrasebank (Malo et al., 2014), Ethos (Mollas et al., 2020), Tweet Eval-Hate, -Atheism, and -Feminist (Barbieri et al., 2020), Medical Questions Pairs (Mc Creery et al., 2020), MRPC (Wang et al., 2019), SICK (Marelli et al., 2014), RTE (Wang et al., 2019), AGNews (Zhang et al., 2015), TREC (Voorhees & Tice, 2000), and DBpedia (Zhang et al., 2015).
Dataset Splits No The paper describes using few-shot learning and evaluation metrics like 'calibrated classification accuracy' and 'sampling demonstration labels with equal probability', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) or refer to standard predefined splits for the experiments performed.
Hardware Specification No The paper does not specify the hardware used for its experiments (e.g., specific GPU or CPU models, memory sizes). It mentions evaluating models like GPT-J-6B, GPT2-XL-1.5B, etc., but not the computational resources used to run them.
Software Dependencies No The paper mentions evaluating models like GPT-J-6B, GPT2-XL, and Pythia models. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup No The paper mentions 'sampling k demonstrations' and 'vary k from 0 to 40', and 'sample 1000 sequences'. It also describes the 'logit lens method' and 'ablating attention heads'. However, it does not provide specific hyperparameter values like learning rates, batch sizes, number of epochs, or optimizer settings, which are typically found in experimental setup details.