Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Authors: Jiahai Feng, Stuart Russell, Jacob Steinhardt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show these effects in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.
Researcher Affiliation	Academia	1UC Berkeley. Correspondence to: Jiahai Feng <EMAIL>.
Pseudocode	No	The paper describes a framework and causal metrics using equations and conceptual diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	All code is available at https://github. com/jiahai-feng/extractive-structures/
Open Datasets	Yes	The list of cities and names are generated from Claude-3.5-sonnet. Here are they: Grace,Miller Ethan,Parker Olivia,Hughes Jacob,Turner Ava,Stewart Noah,Clark Emma,Howard Liam,Bennett Mia,Sanders Lucas,Foster Sophia,Hayes Mason,Brooks Lily,Cooper Jackson,Bell Amelia,Ward Caleb,Bryant Chloe,Campbell Henry,Morgan Ella,Adams Owen,Foster Tokyo,Japan,Japanese,Senso ji Temple Beijing,China,Mandarin,Forbidden City Mumbai,India,Marathi,Gateway of India Paris,France,French,Eiffel Tower Berlin,Germany,German,Brandenburg Gate Moscow,Russia,Russian,St. Basil s Cathedral Cairo,Egypt,Arabic,Great Pyramid of Giza Bangkok,Thailand,Thai,Wat Arun Istanbul,Turkey,Turkish,Blue Mosque Sao Paulo,Brazil,Portuguese,Ibirapuera Park Seoul,South Korea,Korean,N Seoul Tower Rome,Italy,Italian,Colosseum London,United Kingdom,English,Tower Bridge Madrid,Spain,Spanish,Plaza Mayor Athens,Greece,Greek,Acropolis Hanoi,Vietnam,Vietnamese,Ho Chi Minh Mausoleum Addis Ababa,Ethiopia,Amharic,Meskel Square Jakarta,Indonesia,Indonesian,Istiqlal Mosque Tehran,Iran,Persian,Azadi Tower Nairobi,Kenya,Swahili,Uhuru Gardens Below are the 100 names and the 20 animals. The first 80 names are used for training, and the last 20 are used for testing.
Dataset Splits	Yes	In two-hop reasoning, the novel fact can be in either the first hop (a b) or the second hop (b c), and we construct synthetic datasets, FIRST-HOP and SECOND-HOP, to study each (Table 1). In Sec. 6, we introduced a new dataset with fictitious relations. This requires a pairing between cities and a list of animals. We use the same set of cities from before, and use the set of 20 animals that Zhang et al. (2024) generated. Below are the 100 names and the 20 animals. The first 80 names are used for training, and the last 20 are used for testing.
Hardware Specification	No	The paper mentions using models like OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions the Adam optimizer (Kingma & Ba, 2014) and models such as OLMo-7b-0424, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b, but does not specify any general software dependencies or libraries with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Throughout the paper, we finetune the model using the standard cross-entropy loss. We only include the loss on answer tokens. We freeze the embedding and unembedding layers. We use the Adam optimizer (Kingma & Ba, 2014) for 8 epochs at 3 10 6 learning rate, momentum (0.9, 0.999), and batch size 8.