Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Authors: Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. ... We use Gemma Scope (Lieberum et al., 2024)...and find internal representations that suggest to encode knowledge awareness in Gemma 2 2B and 9B. ... We use a test set sample of 100 questions about unknown entities, and measure the number of times the model refuses by steering (as in Equation (4)). ... To do the analysis, we perform activation patching (Geiger et al., 2020; Vig et al., 2020; Meng et al., 2022a) on the residual streams and attention heads outputs... |
| Researcher Affiliation | Academia | Javier Ferrando1,2 Oscar Obeso3 Senthooran Rajamanoharan Neel Nanda 1U. Politècnica de Catalunya 2Barcelona Supercomputing Center 3ETH Zürich |
| Pseudocode | No | The paper describes methods through mathematical equations (e.g., Equation (1), (2), (4), (6), (9), (12)) and narrative text but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 1We make the codebase available at https://github.com/javiferran/sae_entities. |
| Open Datasets | Yes | To study how language models reflect knowledge awareness about entities, we build a dataset with four different entity types: (basketball) players, movies, cities, and songs from Wikidata (Vrandeˇci c & Krötzsch, 2024). |
| Dataset Splits | Yes | Finally, we split the entities into train/validation/test (50%, 10%, 40%) sets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments. It only mentions using Gemma 2 2B, 9B, and Llama 3.1 8B models. |
| Software Dependencies | No | The paper mentions using "Gemma Scope" and "Llama Scope" which are suites of Sparse Autoencoders, and references "Neuronpedia". However, it does not provide specific version numbers for these tools or any other key software libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | We use a validation set to select an appropriate steering coefficient α. ... We select α [400, 550], which corresponds to around two times the norm of the residual stream in the layers where the entity recognition latents are present (Appendix E). |