Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning

Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning.
Researcher Affiliation	Collaboration	Eric Zhao Google Research University of California, Berkeley Pranjal Awasthi Google Research Nika Haghtalab University of California, Berkeley
Pseudocode	No	The paper describes experimental setups and methodologies in detail but does not present any explicitly labeled pseudocode or algorithm blocks. The methods are described narratively within the text.
Open Source Code	Yes	Full source code is released at https://github.com/ericzhao28/finegrained_finetuning_analysis.
Open Datasets	Yes	We first collect a pool of Wikipedia articles about events occurring in 2024, after the model s training data cutoff.
Dataset Splits	Yes	Each finetuning dataset comprises 200,000 characters (roughly 500 – 700 interactions) subsampled from the corresponding tone pool. For evaluation, we use a holdout set of 100 prompts from the initial 1,500, ensuring no overlap with finetuning data.
Hardware Specification	No	The paper mentions finetuning on 'Gemini v1.5 Pro and Gemini v1.5 Flash models' and specifies 'LORA finetuning [Hu et al., 2022]', but it does not provide specific details on the underlying hardware (e.g., GPU models, CPU types, or TPU versions) used for these experiments.
Software Dependencies	No	The paper mentions using 'LORA finetuning [Hu et al., 2022]' for finetuning, but it does not specify version numbers for LORA, or any other software components like programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or operating systems.
Experiment Setup	Yes	For each tone, we train three Gemini v1.5 Pro models [Gemini Team, 2024], with a fresh dataset for each seed, for 40 epochs using LORA [Hu et al., 2022] finetuning.