Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Emergence of Linear Truth Encodings in Language Models

Authors: Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate this pattern with experiments in pretrained language models. We test the truth-co-occurrence hypothesis (TCH) in the minimal transformer-like model, with a single self-attention layer, one head, a normalization layer, and no MLP. Training examples are four-token sequences x y x y with subjects x, x ( The capital city of France ; Churchill s nationality ) and attributes y, y ( Paris ; British ); with probability ρ, the attributes y, y are both the correct attribute; with probability 1 ρ, they are replaced with a random one. When we train LMs on such dataset, we find that after the key value lookup circuit forms, gradient descent pushes hidden states toward a linear separator that clusters true vs. false contexts, and the model uses it modify its confidence when predicting the attribute. Training shows two phases: rapid key value acquisition followed by slower emergence of linear encoding. Although our toy model is far simpler than natural training data (see Section 6), it predicts the observed sensitivity to false context (Section 5.3), where false prefixes bias later predictions (supporting TCH), and reproduces the way normalization layers regulates confidence [Stolfo et al., 2024]. Taken together, we show that linear truth encoding can arise without any built-in semantics.
Researcher Affiliation Academia Shauli Ravfogel1 Gilad Yehudai1 Tal Linzen1 Joan Bruna1 Alberto Bietti2 1New York University 2Flatiron Institute
Pseudocode No The paper describes the model architecture and training dynamics using mathematical equations and textual descriptions (e.g., in Section 4 and Appendix E.1), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes 2We release the code in https://github.com/shauli-ravfogel/truth-encoding-neurips.
Open Datasets Yes To quantify that, we use the MAVEN-FACT corpus [Li et al., 2024a], where annotators assign a Fact Bankstyle factuality label to every event mention inside a news article. We evaluate on the Counter Fact dataset [Meng et al., 2022], a collection of simple factual assertions spanning relations such as SPEAKSLANGUAGE and BORNIN.
Dataset Splits Yes We fit individual classifiers both the first attribute position (y), as well as on the second subject position (x ), from which the second attribute y is predicted. While in training the LM we use a varying true-attribute rate ρ, the linear classifiers are always trained and evaluated on a balanced set, containing 50% true sequences. We use the train split of MAVEN-FACT v1.0 (73,939 event mentions drawn from 2 913 news articles).
Hardware Specification Yes We run all experiments on 4 NVIDIA Ge Force GTX 1080 GPUs.
Software Dependencies No The model is trained for 50,000 batches of size 128 and is optimized with the Adam optimizer [Kingma and Ba, 2015] with a learning weight of 1e-4 and a weight decay of 1e-5. We do not include biases in the attention modules, and use RMSNorm as layer normalization. While the Adam optimizer is mentioned, specific version numbers for software libraries or the programming language used are not provided.
Experiment Setup Yes Our experiments use β = d, due to the use of RMS norm in layer-norm over embeddings of dimension d. The model is trained for 50,000 batches of size 128 and is optimized with the Adam optimizer [Kingma and Ba, 2015] with a learning weight of 1e-4 and a weight decay of 1e-5. We experiment with true-attributes rates ρ, and with l {1, 2, 3} layers, and assume a perfect correlation between the truthfulness of the first and second attributes. Unless specified otherwise, we present here results for l = 1 and ρ = 0.99, |A| = |S| = 512 and dmodel = 256. For the natural language experiments, we train a small transformer with RMS normalization, 2 attention heads and a single MLP module per layer, hidden size d = 256, and depth l {2, 5, 9} on this corpus. We use ρ = 0.99. We train on data from a single relation at a time, and report mean and standard deviations over 5 random relations.