Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning
Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. |
| Researcher Affiliation | Collaboration | Eric Zhao Google Research University of California, Berkeley Pranjal Awasthi Google Research Nika Haghtalab University of California, Berkeley |
| Pseudocode | No | The paper describes experimental setups and methodologies in detail but does not present any explicitly labeled pseudocode or algorithm blocks. The methods are described narratively within the text. |
| Open Source Code | Yes | Full source code is released at https://github.com/ericzhao28/finegrained_finetuning_analysis. |
| Open Datasets | Yes | We first collect a pool of Wikipedia articles about events occurring in 2024, after the model s training data cutoff. |
| Dataset Splits | Yes | Each finetuning dataset comprises 200,000 characters (roughly 500 – 700 interactions) subsampled from the corresponding tone pool. For evaluation, we use a holdout set of 100 prompts from the initial 1,500, ensuring no overlap with finetuning data. |
| Hardware Specification | No | The paper mentions finetuning on 'Gemini v1.5 Pro and Gemini v1.5 Flash models' and specifies 'LORA finetuning [Hu et al., 2022]', but it does not provide specific details on the underlying hardware (e.g., GPU models, CPU types, or TPU versions) used for these experiments. |
| Software Dependencies | No | The paper mentions using 'LORA finetuning [Hu et al., 2022]' for finetuning, but it does not specify version numbers for LORA, or any other software components like programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or operating systems. |
| Experiment Setup | Yes | For each tone, we train three Gemini v1.5 Pro models [Gemini Team, 2024], with a fresh dataset for each seed, for 40 epochs using LORA [Hu et al., 2022] finetuning. |