Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Copyright-Protected Language Generation via Adaptive Model Fusion
Authors: Javier Abad, Konstantin Donhauser, Francesco Pinto, Fanny Yang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of protected material without compromising the quality of text and code generation. Moreover, its post-hoc nature allows seamless integration with other protective measures, further enhancing copyright safeguards. Lastly, we show that CP-Fuse is robust against common techniques for extracting training data. 1 INTRODUCTION Large Language Models (LLMs), such as GPT-4 (Achiam et al., 2023) and Gemini (Team et al., 2023), have achieved undeniable success. ... 4 EXPERIMENTS We conduct our experiments using language models that are commonly employed in practical applications. |
| Researcher Affiliation | Academia | Javier Abad ETH Zurich Konstantin Donhauser ETH Zurich Francesco Pinto University of Chicago Fanny Yang ETH Zurich |
| Pseudocode | No | The paper provides mathematical formulations (e.g., Equation (2) and Lemma 3.2) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | 1See our Git Hub repository: https://github.com/jaabmar/cp_fuse. |
| Open Datasets | Yes | We use four code-based (Python instructions, APPS, MBPP, Human Eval) and two text-based (Math Abstracts, Writing Promtps) datasets in our experiments, all downloadable from Hugging Face. The first code-based dataset6 is an instructional dataset for Python (Python instructions) ... The APPS dataset7 is a benchmark for code generation ... Both MBPP8 and the Human Eval9 datasets are standard for assessing code generation ... For the text-based experiments, we use the Auto Math Text dataset10 (Zhang et al., 2024b), referred to as Math Abstracts. ... Finally, the Writing Prompts dataset12 (Fan et al., 2018) contains amateur-level stories from a Reddit forum. |
| Dataset Splits | Yes | Each dataset (details provided below) is partitioned into two non-overlapping subsets of 3,000 samples each, and a separate model is fine-tuned on each subset. ... We include additional results on a test set comprising 500 prompts. |
| Hardware Specification | Yes | All experiments were conducted using NVIDIA A40 GPUs. For each token generated by CP-Fuse, the method involves two steps: (1) a forward pass through the two models, and (2) solving an optimization problem via grid search. Forward Passes: The base models perform a forward pass at each decoding step. In our experiments, both models were run on a single NVIDIA A40 GPU. ... The training was performed on A100 GPUs. |
| Software Dependencies | No | The paper mentions software tools and packages like 'Adam W (8-bit)', 'finetuning-harness', 'bigcode-evaluation-harness', 'nltk package', 'spacy package', 'JPlag', and 'Dolos', but it does not specify explicit version numbers for these components to ensure reproducibility. |
| Experiment Setup | Yes | D.2 FINE-TUNING DETAILS We fine-tuned our models using a setup inspired by the repository finetuning-harness, available under the MIT License4. The training was performed on A100 GPUs. The main hyperparameters for our fine-tuning process are listed in Table 16. We fine-tuned our Table 16: Main hyperparameters for fine-tuning Hyperparameter Value Sequence Length 2048 Batch Size 1 Learning Rate 5e-5 Gradient Accumulation Steps 1 Optimizer Adam W (8-bit) Warmup Steps 50 Neptune Noise α = 5.0 |