Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Authors: Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. ... We use Gemma Scope (Lieberum et al., 2024)...and find internal representations that suggest to encode knowledge awareness in Gemma 2 2B and 9B. ... We use a test set sample of 100 questions about unknown entities, and measure the number of times the model refuses by steering (as in Equation (4)). ... To do the analysis, we perform activation patching (Geiger et al., 2020; Vig et al., 2020; Meng et al., 2022a) on the residual streams and attention heads outputs...
Researcher Affiliation Academia Javier Ferrando1,2 Oscar Obeso3 Senthooran Rajamanoharan Neel Nanda 1U. Politècnica de Catalunya 2Barcelona Supercomputing Center 3ETH Zürich
Pseudocode No The paper describes methods through mathematical equations (e.g., Equation (1), (2), (4), (6), (9), (12)) and narrative text but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1We make the codebase available at https://github.com/javiferran/sae_entities.
Open Datasets Yes To study how language models reflect knowledge awareness about entities, we build a dataset with four different entity types: (basketball) players, movies, cities, and songs from Wikidata (Vrandeˇci c & Krötzsch, 2024).
Dataset Splits Yes Finally, we split the entities into train/validation/test (50%, 10%, 40%) sets.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments. It only mentions using Gemma 2 2B, 9B, and Llama 3.1 8B models.
Software Dependencies No The paper mentions using "Gemma Scope" and "Llama Scope" which are suites of Sparse Autoencoders, and references "Neuronpedia". However, it does not provide specific version numbers for these tools or any other key software libraries like PyTorch, TensorFlow, or Python.
Experiment Setup Yes We use a validation set to select an appropriate steering coefficient α. ... We select α [400, 550], which corresponds to around two times the norm of the residual stream in the layers where the entity recognition latents are present (Appendix E).