reproducibilityindex.ai

Copyright Traps for Large Language Models

Authors: Matthieu Meeus, Igor Shilov, Manuel Faysse, Yves-Alexandre De Montjoye

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carefully design a randomized controlled experimental setup, inserting traps into original content (books) and train a 1.3B LLM from scratch. We first validate that the use of content in our target model would be undetectable using existing methods. We then show, contrary to intuition, that even medium-length trap sentences repeated a significant number of times (100) are not detectable using existing methods. However, we show that longer sequences repeated a large number of times can be reliably detected (AUC=0.75) and used as copyright traps.
Researcher Affiliation	Academia	1Department of Computing, Imperial College London, United Kingdom 2MICS, Centrale Sup elec, Universit e Paris-Saclay, Paris, France.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code used for trap sequence generation and analysis is available on github2.
Open Datasets	Yes	More specifically, we use the open-source library (Pully, 2020) to collect 9,542 books made available in the public domain on Project Gutenberg (Hart, 1971)
Dataset Splits	Yes	As members, we consider the trap sequences, both MD,synth and MD,real, which we created and injected as described in Sec. 4.1 and Sec. 4.3 as they all have been included in the training dataset of LM. As non-members, we repeat the exact same generation process to create a similar set of sequences that we exclude from the training dataset.
Hardware Specification	Yes	It is trained with Microsoft Deep Speed on a distributed compute cluster, with 30 nodes of 8 x Nvidia A100 GPUs during 17 days.
Software Dependencies	No	The paper mentions 'Microsoft Deep Speed' and a 'BPE Sentence Piece tokenizer' but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	The model is a 1.3 billion parameter LLa MA model (Touvron et al., 2023b) with 24 layers, a hidden size of 2,048, an intermediate size of 5,504 and 16 keyvalue heads. Training is done with a batches of 7,680 sequences of length 2,048, which means that over 15 million tokens are seen at each training step.