Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Looking Inward: Language Models Can Learn About Themselves by Introspection
Authors: Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments with GPT-4, GPT-4o, and Llama-3 models, we find that the model M1 outperforms M2 in predicting itself, providing evidence for privileged access. Further experiments and ablations provide additional evidence. |
| Researcher Affiliation | Collaboration | Felix J Binder UCSD, Stanford James Chua Truthful AI Tomek Korbak Independent Henry Sleight MATS Program John Hughes Speechmatics Robert Long Eleos AI Ethan Perez Anthropic Miles Turpin Scale AI, NYU Owain Evans UC Berkeley, Truthful AI |
| Pseudocode | No | The paper describes methods and experiments narratively and with diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Code: We will make our code for data processing, model finetuning, and evaluation publicly available on Git Hub after the review process. This includes implementations of our self-prediction and cross-prediction training procedures. |
| Open Datasets | Yes | We use publicly available datasets such as Wikipedia and MMLU. We augment existing datasets with additional hypothetical questions. We will release all augmented datasets, along with the prompts used to create them. ... Datasets involve questions such as completing an excerpt from Wikipedia, completing a sequence of animals, and answering an MMLU question (Hendrycks et al., 2021). |
| Dataset Splits | Yes | We train on 6 datasets and hold out the remaining 6 for testing to distinguish true introspection from mere memorization of training data. See Section A.4.3 for the full set of datasets. |
| Hardware Specification | No | For our experiments with Open AI models, we used a batch size of 20... For finetuning the Llama models, we utilized the Fireworks API with default settings... |
| Software Dependencies | No | For finetuning the Llama models, we utilized the Fireworks API (Fireworks.ai, 2024)... For experiments with Open AI models (GPT-4o, GPT-4 (Open AI et al., 2024), and GPT-3.5 (Open AI et al., 2024)), we use Open AI s finetuning API (Open AI, 2024c). |
| Experiment Setup | Yes | For our experiments with Open AI models, we used a batch size of 20, 1 epoch, and a learning rate of 2... For finetuning the Llama models, we utilized the Fireworks API with default settings: a batch size of 16, Lo RA rank of 32, 1 epoch, and a learning rate of 2.00E-05. |