Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training-free Detection of AI-generated images via Cropping Robustness

Authors: Sungik Choi, Hankook Lee, Moontae Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive evaluations of Wa RPAD across multiple AI-generated image detection benchmarks. These benchmarks span various generative model types, including LDMs, proprietary models (e.g., Firefly [17], Dall-E [18]), and generative adversarial network (GAN) [19] architectures, as well as multiple image domains. Our method consistently outperforms other training-free baselines in all settings. Notably, we observe an improvement of 6.5 24.7% in AUROC over prior methods based on the same DINOv2 model. Furthermore, we evaluate the robustness of our method against various image corruptions and show that it maintains competitive performance under such conditions, surpassing other detection methods.
Researcher Affiliation	Collaboration	Sungik Choi1 Hankook Lee2 Moontae Lee1,3 1LG AI Research 2Sungkyun Kwan University 3University of Illinois Chicago EMAIL
Pseudocode	Yes	A.1 Pseudocode of War PAD We show the Pytorch-like pseudocode of Wa RPAD in Algorithm 1. Note that all operations allow batch-wise computation, hence we can process the input patches in a batch-wise manner. Algorithm 1 Wa RPAD (Py Torch-like Pseudo-code) # f(x): normalized [cls] token output of self-supervised model # alpha: weight of perturbation # DWTForward, DWTInverse: forward and inverse discrete wavelet transform # Sim: cosine similarity function def HFwav(x): x_low, x_high = DWTForward(x) N_perturb = DWTInverse([torch.zeros_like(x_low), x_high]) feat_original = f(x) feat_perturb = f(x alpha * N_perturb) return Sim(feat_original, feat_perturb) def Wa RPAD(x): x_patch = Rescale NPatchify(x) f_patch = HFwav(x_patch) return f_patch.mean()
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will publish the code after the paper gets accepted. However, as our algorithm is deterministic and the benchmark is public, it would be easy to reproduce the results.
Open Datasets	Yes	Datasets. We first test Wa RPAD in Synthbuster [34] benchmark based on RAISE-1k dataset [25] where 9 generative models are applied: Firefly [17], GLIDE [38], SDXL [39], SDv2, SDv1.3, SDv1.4 [8], DALL-E 3 [40], DALL-E 2 [18], and Midjourney [9]. We also test Wa RPAD in Gen Image [35] benchmark where the real data is from the Image Net [5] dataset and 8 generative models are applied for fake image generation: ADM [41], Big GAN [42], GLIDE, Midjourney, SDv1.4, SDv1.5, VQDM [43], and Wukong [44]. Finally, we test on the deepfake-LSUN-bedroom benchmark [36] containing 10 generative models: ADM, DDPM [45], Diff-Projected GAN, Diff-Style GAN2 [46], IDDPM [47], LDM [8], PNDM [48], Pro GAN [49], Projected GAN [50], and Style GAN [51]. We summarize the main information of these benchmarks in Table 1. Synthbuster. The Synthbuster benchmark consists of 1000 real RAISE-1k images and 9000 AIgenerated images consisting of scene and art images under the CC BY-NC-SA 4.0 license. We download all real 3 and AI-generated datasets 4 in the URL via the author s official repository. Gen Image. The Gen Image benchmark consists of Image Net real data and AI-generated data consisting of 8 different generative models under the CC BY-NC-SA 4.0 license. Each test consists of pairs of real and AI-generated image pairs, where the size is 6000+6000 except of SDv1.5, where the size is 8000+8000. We download the datasets via the author s official repository 5. Deepfake-LSUN-Bedroom. The Deepfake-LSUN-Bedroom benchmark consists of 10000 real LSUN-Bedroom images and 10 10000 AI-generated data where the model is trained to generate LSUN-Bedroom-like images. We download the datasets via the author s official repository 6 under the CC BY 4.0 license.
Dataset Splits	Yes	Synthbuster. The Synthbuster benchmark consists of 1000 real RAISE-1k images and 9000 AIgenerated images consisting of scene and art images under the CC BY-NC-SA 4.0 license. Gen Image. ...Each test consists of pairs of real and AI-generated image pairs, where the size is 6000+6000 except of SDv1.5, where the size is 8000+8000. Deepfake-LSUN-Bedroom. The Deepfake-LSUN-Bedroom benchmark consists of 10000 real LSUN-Bedroom images and 10 10000 AI-generated data where the model is trained to generate LSUN-Bedroom-like images.
Hardware Specification	Yes	All experiments are done on a single A100 GPU.
Software Dependencies	No	We implement our code in the Pytorch [52] framework.
Experiment Setup	Yes	Implementation Details. The performance of all methods is reported by the area under the ROC curve (AUROC). Consistent with RIGID and MINDER, we use the DINO-Vi T-L14 model as the base model. We use the Haar wavelet with a 2-level decomposition to extract the high-frequency information. We also set dpatch and α to 224 and 0.1 throughout all experiments. For the rescaling dimension drescale, we set 896 for the Gen Image and Deepfake-LSUN-bedroom benchmark and 1344 for the Synthbuster benchmark.