Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Looking Inward: Language Models Can Learn About Themselves by Introspection

Authors: Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with GPT-4, GPT-4o, and Llama-3 models, we find that the model M1 outperforms M2 in predicting itself, providing evidence for privileged access. Further experiments and ablations provide additional evidence.
Researcher Affiliation	Collaboration	Felix J Binder UCSD, Stanford James Chua Truthful AI Tomek Korbak Independent Henry Sleight MATS Program John Hughes Speechmatics Robert Long Eleos AI Ethan Perez Anthropic Miles Turpin Scale AI, NYU Owain Evans UC Berkeley, Truthful AI
Pseudocode	No	The paper describes methods and experiments narratively and with diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Code: We will make our code for data processing, model finetuning, and evaluation publicly available on Git Hub after the review process. This includes implementations of our self-prediction and cross-prediction training procedures.
Open Datasets	Yes	We use publicly available datasets such as Wikipedia and MMLU. We augment existing datasets with additional hypothetical questions. We will release all augmented datasets, along with the prompts used to create them. ... Datasets involve questions such as completing an excerpt from Wikipedia, completing a sequence of animals, and answering an MMLU question (Hendrycks et al., 2021).
Dataset Splits	Yes	We train on 6 datasets and hold out the remaining 6 for testing to distinguish true introspection from mere memorization of training data. See Section A.4.3 for the full set of datasets.
Hardware Specification	No	For our experiments with Open AI models, we used a batch size of 20... For finetuning the Llama models, we utilized the Fireworks API with default settings...
Software Dependencies	No	For finetuning the Llama models, we utilized the Fireworks API (Fireworks.ai, 2024)... For experiments with Open AI models (GPT-4o, GPT-4 (Open AI et al., 2024), and GPT-3.5 (Open AI et al., 2024)), we use Open AI s finetuning API (Open AI, 2024c).
Experiment Setup	Yes	For our experiments with Open AI models, we used a batch size of 20, 1 epoch, and a learning rate of 2... For finetuning the Llama models, we utilized the Fireworks API with default settings: a batch size of 16, Lo RA rank of 32, 1 epoch, and a learning rate of 2.00E-05.