Copyright Traps for Large Language Models
Authors: Matthieu Meeus, Igor Shilov, Manuel Faysse, Yves-Alexandre De Montjoye
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carefully design a randomized controlled experimental setup, inserting traps into original content (books) and train a 1.3B LLM from scratch. We first validate that the use of content in our target model would be undetectable using existing methods. We then show, contrary to intuition, that even medium-length trap sentences repeated a significant number of times (100) are not detectable using existing methods. However, we show that longer sequences repeated a large number of times can be reliably detected (AUC=0.75) and used as copyright traps. |
| Researcher Affiliation | Academia | 1Department of Computing, Imperial College London, United Kingdom 2MICS, Centrale Sup elec, Universit e Paris-Saclay, Paris, France. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used for trap sequence generation and analysis is available on github2. |
| Open Datasets | Yes | More specifically, we use the open-source library (Pully, 2020) to collect 9,542 books made available in the public domain on Project Gutenberg (Hart, 1971) |
| Dataset Splits | Yes | As members, we consider the trap sequences, both MD,synth and MD,real, which we created and injected as described in Sec. 4.1 and Sec. 4.3 as they all have been included in the training dataset of LM. As non-members, we repeat the exact same generation process to create a similar set of sequences that we exclude from the training dataset. |
| Hardware Specification | Yes | It is trained with Microsoft Deep Speed on a distributed compute cluster, with 30 nodes of 8 x Nvidia A100 GPUs during 17 days. |
| Software Dependencies | No | The paper mentions 'Microsoft Deep Speed' and a 'BPE Sentence Piece tokenizer' but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | The model is a 1.3 billion parameter LLa MA model (Touvron et al., 2023b) with 24 layers, a hidden size of 2,048, an intermediate size of 5,504 and 16 keyvalue heads. Training is done with a batches of 7,680 sequences of length 2,048, which means that over 15 million tokens are seen at each training step. |