Language Models Represent Beliefs of Self and Others

Authors: Wentao Zhu, Zhining Zhang, Yizhou Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we discover that it is possible to linearly decode the belief status from the perspectives of various agents through neural activations of language models, indicating the existence of internal representations of self and others beliefs. By manipulating these representations, we observe dramatic changes in the models To M performance, underscoring their pivotal role in the social reasoning process. Additionally, our findings extend to diverse social reasoning tasks that involve different causal inference patterns, suggesting the potential generalizability of these representations. We evaluate the To M capabilities of language models using the Big To M (Gandhi et al., 2023) benchmark.
Researcher Affiliation Academia Wentao Zhu 1 Zhining Zhang 1 Yizhou Wang 1 2 3 4 1Center on Frontiers of Computing Studies, School of Compter Science, Peking University 2Inst. for Artificial Intelligence, Peking University 3Nat l Eng. Research Center of Visual Technology, Peking University 4Nat l Key Lab of General Artificial Intelligence, Peking University.
Pseudocode No No pseudocode or algorithm block was found.
Open Source Code Yes *Project page: https://walter0807.github.io/ Rep Belief/
Open Datasets Yes We utilize the Big To M dataset (Gandhi et al., 2023) which is constructed with a causal template and an example scenario including prior desires, actions, beliefs, and a causal event that changes the state of the environment. In addition to the stories in Big To M (Gandhi et al., 2023), we explore whether our findings could generalize to other narratives. Following (Wilf et al., 2023), we extend our study to the To Mi benchmark (Le et al., 2019).
Dataset Splits No We train and evaluate the probes on a held-out subset without access to the stories in the test set of the benchmark. Figure 2 (A) and (B) display the validation accuracies of the linear probes. While the paper mentions using a "held-out subset" and "validation accuracies", it does not provide specific details on the size or percentages of the dataset splits required for reproduction.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned in the paper.
Software Dependencies No We thank the awesome open-source toolbox nnsight (Fiotto-Kaufman) which is used to extract the Transformer internal representations. No version was specified for nnsight, and other software mentioned are models (Mistral-7B-Instruct, Deep Seek-LLM-7B-Chat) rather than general software dependencies with versions.
Experiment Setup Yes Both models are tested using the most deterministic setting with a temperature of 0 following (Gandhi et al., 2023). We set K and α with grid search following previous works, and present the ablations in Appendix G. We provide intervention results with different (K, α) combinations in Figure 18. K = 16 yields the most steady performance and is used in our experiments for the model.