Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Copyright-Protected Language Generation via Adaptive Model Fusion

Authors: Javier Abad, Konstantin Donhauser, Francesco Pinto, Fanny Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of protected material without compromising the quality of text and code generation. Moreover, its post-hoc nature allows seamless integration with other protective measures, further enhancing copyright safeguards. Lastly, we show that CP-Fuse is robust against common techniques for extracting training data. 1 INTRODUCTION Large Language Models (LLMs), such as GPT-4 (Achiam et al., 2023) and Gemini (Team et al., 2023), have achieved undeniable success. ... 4 EXPERIMENTS We conduct our experiments using language models that are commonly employed in practical applications.
Researcher Affiliation	Academia	Javier Abad ETH Zurich Konstantin Donhauser ETH Zurich Francesco Pinto University of Chicago Fanny Yang ETH Zurich
Pseudocode	No	The paper provides mathematical formulations (e.g., Equation (2) and Lemma 3.2) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	1See our Git Hub repository: https://github.com/jaabmar/cp_fuse.
Open Datasets	Yes	We use four code-based (Python instructions, APPS, MBPP, Human Eval) and two text-based (Math Abstracts, Writing Promtps) datasets in our experiments, all downloadable from Hugging Face. The first code-based dataset6 is an instructional dataset for Python (Python instructions) ... The APPS dataset7 is a benchmark for code generation ... Both MBPP8 and the Human Eval9 datasets are standard for assessing code generation ... For the text-based experiments, we use the Auto Math Text dataset10 (Zhang et al., 2024b), referred to as Math Abstracts. ... Finally, the Writing Prompts dataset12 (Fan et al., 2018) contains amateur-level stories from a Reddit forum.
Dataset Splits	Yes	Each dataset (details provided below) is partitioned into two non-overlapping subsets of 3,000 samples each, and a separate model is fine-tuned on each subset. ... We include additional results on a test set comprising 500 prompts.
Hardware Specification	Yes	All experiments were conducted using NVIDIA A40 GPUs. For each token generated by CP-Fuse, the method involves two steps: (1) a forward pass through the two models, and (2) solving an optimization problem via grid search. Forward Passes: The base models perform a forward pass at each decoding step. In our experiments, both models were run on a single NVIDIA A40 GPU. ... The training was performed on A100 GPUs.
Software Dependencies	No	The paper mentions software tools and packages like 'Adam W (8-bit)', 'finetuning-harness', 'bigcode-evaluation-harness', 'nltk package', 'spacy package', 'JPlag', and 'Dolos', but it does not specify explicit version numbers for these components to ensure reproducibility.
Experiment Setup	Yes	D.2 FINE-TUNING DETAILS We fine-tuned our models using a setup inspired by the repository finetuning-harness, available under the MIT License4. The training was performed on A100 GPUs. The main hyperparameters for our fine-tuning process are listed in Table 16. We fine-tuned our Table 16: Main hyperparameters for fine-tuning Hyperparameter Value Sequence Length 2048 Batch Size 1 Learning Rate 5e-5 Gradient Accumulation Steps 1 Optimizer Adam W (8-bit) Warmup Steps 50 Neptune Noise α = 5.0