Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Data-Copying in Generative Models: A Formal Framework
Authors: Robi Bhattacharjee, Sanjoy Dasgupta, Kamalika Chaudhuri
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We include an empirical comparison between our methods in Section 5.2, where we demonstrate that ours is able to capture certain forms of data-copying that theirs is not. We now return to the example presented in Figure 3 and empirically investigate the following question: is our algorithm able to outperform the one given in (Meehan et al., 2020) over this example? |
| Researcher Affiliation | Academia | Robi Bhattacharjee 1 Sanjoy Dasgupta 1 Kamalika Chaudhuri 1 *Equal contribution 1UCSD. Correspondence to: Robi Bhattacharjee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Est(x, r, S) Algorithm 2: Data Copy Detect(S, q, m) Algorithm 3: Estimate k(S) |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing code or links to a code repository. |
| Open Datasets | Yes | Our data distribution, p, is the Halfmoon dataset with Gaussian noise (σ = 0.1). Our generated distribution, q, is trained from an i.i.d sample of 2000 points from p, S p2000. |
| Dataset Splits | No | The paper describes the generation of the distribution q and the use of a training sample S for q, but it does not provide explicit training, validation, or test dataset splits for the models or detector evaluated in the experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models or other computing specifications used for the experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers, such as programming languages or libraries used for implementation. |
| Experiment Setup | Yes | Our data distribution, p, is the Halfmoon dataset with Gaussian noise (σ = 0.1). To construct qcopy... with a small amount of spherical noise (with radius 0.02). To construct qunderfit... with a moderate amount of spherical noise (with radius 0.25). We fix λ = 20 and γ = 0.00025 as constants for data-copy detection. We directly set m = 200, 000. For Est(x, r, S), we set b = 400 We set λ = 20 and γ = 1 4000. |