Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Authors: Luxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address these questions, we introduce a novel evaluation framework with metrics that assess both the generated image s similarity to copyrighted characters and its consistency with user intent, grounded in a set of popular copyrighted characters from diverse studios and regions. We show that state-of-the-art image and video generation models can still generate characters... We also introduce semi-automatic techniques to identify such keywords or descriptions that trigger character generation. Using this framework, we evaluate mitigation strategies... This section presents our empirical results...
Researcher Affiliation	Academia	Princeton University University of Washington University of Wisconsin-Madison University of Southern California Emails for leading authors: EMAIL, EMAIL, and EMAIL.
Pseudocode	Yes	Algorithm 1 EMBEDDINGSIM Ranking Input: Character name C, n candidate words W = {wi}i [n], text encoder g 1: for each wi in W do 2: Encode wi to g(wi) using g 3: swi g(C) g(wi)/ g(C) g(wi) 4: end for 5: Sort W by swi in descending order 6: return Sorted W Algorithm 2 CO-OCCURRENCE Ranking Input: Character name C, n candidate words W = {wi}i [n], training corpora D 1: for each document d in D do 2: if C and wi co-occur in d then swi swi + 1 3: end if 4: end for 5: Sort W by swi in descending order 6: return Sorted W
Open Source Code	Yes	Our code is available at https://github.com/princeton-nlp/Copy Cat.
Open Datasets	Yes	We examine common training corpora, including captions from image-captioning datasets: LAION-2B (Schuhmann et al., 2022)), as well as text-only datasets (C4 (Raffel et al., 2020), Open Web Text (Radford et al., 2019), and The Pile (Gao et al., 2020)... We curated a dataset comprising 200 descriptions of copyrighted characters and 200 standard prompts unlikely to cause copyright issues selected from MJHQ benchmarks . huggingface.co/datasets/playgroundai/MJHQ-30K
Dataset Splits	No	The paper uses various publicly available datasets like LAION-2B, C4, Open Web Text, and The Pile for keyword ranking, and curates a dataset of 200 copyrighted character descriptions and 200 standard prompts for detector evaluation. However, it does not provide explicit training/validation/test splits (e.g., percentages, sample counts, or citations to predefined splits) for these datasets in the context of its own experiments, making it difficult to reproduce the exact data partitioning.
Hardware Specification	Yes	All experiments are conducted on 2 NVIDIA A100 GPU cards, each with 80GB of memory.
Software Dependencies	No	The paper mentions using GPT-4, GPT-4V, and CLIP-Flan T5 as evaluators and for prompt generation, but it does not provide specific version numbers for these models or any other software libraries used (e.g., Python, PyTorch, CUDA versions), which is necessary for a reproducible description of software dependencies.
Experiment Setup	Yes	For Playground v2.5, Stable Diffusion XL (SDXL), and Pix Art-α, we use 50 iterative steps to progressively refine the image from noise to a coherent output. We set guidance_scale to 3 for the strength of the conditioning signal. For Deep Floyd IF, we use the standard 3-stage set-up. Models for the 3 stages are Deep Floyd s IF-IXL-v1.0, IF-II-L-v1.0, and Stability AI s stable-diffusion-x4-upscaler respectively. All generation configurations are the model s default. For video generation on Video Fusion, we use the model s default parameters to generate a 16-frame video, and take the first, middle, and last frames for detailed study.