Confidence Regulation Neurons in Language Models

Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. ... We observe the presence of entropy neurons across a range of models, up to 7 billion parameters. On the other hand, token frequency neurons, which we discover and describe here for the first time, boost or suppress each token s logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution. Finally, we present a detailed case study where entropy neurons actively manage confidence in the setting of induction, i.e. detecting and continuing repeated subsequences.
Researcher Affiliation Academia Alessandro Stolfo ETH Zürich Ben Wu University of Sheffield Wes Gurnee MIT Yonatan Belinkov Technion Xingyi Song University of Sheffield Mrinmaya Sachan ETH Zürich Neel Nanda
Pseudocode No The paper describes various procedures and analyses, such as the calculation of total and direct effects or the SVD of the unembedding matrix, but it does not include any formally structured blocks or sections labeled “Pseudocode” or “Algorithm.”
Open Source Code Yes Our code and data are available at https://github.com/bpwu1/confidence-regulation-neurons.
Open Datasets Yes We carry out all our experiments using data from the C4 Corpus [63], which is publicly available through a ODC-BY license.9 ... [63] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. URL https://www.jmlr.org/papers/v21/20-074.html.
Dataset Splits Yes The experiments to measure the total and direct effects, both for entropy and token frequency neurons, were carried out on 256-token input sequences. We used 100 sequences for GPT-2 and Pythia 410M, 50 for Phi-2 and GPT-2 Medium, and 30 for LLa MA2 7B and Gemma 2B. The scatter plot in Figure 3a was obtained on 10000 tokens. The results on the induction case study presented in 6.1 were obtained on 500 input sequences, the shaded area around each line represents the standard error.
Hardware Specification Yes The static analyses of the model weights were conducted on a Mac Book Pro with 32GB of memory. The experiments carried out to quantify the neurons effects were carried out on a single 80GB Nvidia A100.
Software Dependencies No Our experiments were carried out using Py Torch [59] and the Transformers Lens library [53]. We performed our data analysis using Num Py [30] and Pandas [77]. Our figures were made using Plotly [37]. The paper lists several software tools and libraries used in the research but does not specify their exact version numbers (e.g., PyTorch 1.x, NumPy 1.x).
Experiment Setup Yes For each neuron, we measure the total and direct effects when its activation value is set to the mean across a dataset of 25600 tokens from the C4 Corpus... Our initial step is to identify which neurons are entropy neurons. We do this by searching for neurons with a high weight norm and a minimal impact on the logits. To detect minimal effect on the logits, we follow the heuristic used by Gurnee et al. [28], analyzing the variance in the effect of neurons on the logits. ... We compare these metrics for six selected entropy neurons against those from 100 randomly selected neurons. ... To analyze this phenomenon, we create input sequences by selecting 100 tokens from C4 [63] and duplicating them to form a 200-token input sequence. Across 100 such sequences, we measure GPT-2 Small s performance and observe a significant decrease in both average loss and entropy during the second occurrence of the sequence (solid lines in Figure 5a).