Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries

Authors: Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our SPTI method on six datasets: LSUN Bedroom [58], Cat [27], European Art [41], Wave-ui-25k [61], MMCeleb A-HQ [51, 52, 33, 21, 24], and our newly constructed Sprite Fright. These datasets are selected because each represents a realistic, privacy-sensitive scenario. For instance, LSUN Bedroom captures aspects of an individual s private living space, while Wave-ui-25k consists of screenshots that may reveal personal information from electronic devices.
Researcher Affiliation	Collaboration	Haoxiang Wang Peking University EMAIL Zinan Lin Microsoft Research EMAIL Da Yu Google Research EMAIL Huishuai Zhang Wangxuan Institute of Computer Technology, Peking University State Key Laboratory of General Artificial Intelligence EMAIL
Pseudocode	Yes	Algorithm 1 SPTI: Privately Synthesize High-Resolution Images via Synthesis via Private Textual Intermediaries Algorithm 2 Aug-PE with Image Voting (Aug_PE_Image_Voting)
Open Source Code	Yes	Our code release: https://github.com/Mark Godrick/SPTI
Open Datasets	Yes	We evaluate our SPTI method on six datasets: LSUN Bedroom [58], Cat [27], European Art [41], Wave-ui-25k [61], MMCeleb A-HQ [51, 52, 33, 21, 24], and our newly constructed Sprite Fright.
Dataset Splits	No	The paper describes generating different numbers of samples (e.g., 2,000, 4,000, 6,000, 8,000, and 10,000) for training a downstream classifier, and mentions evaluating test accuracy. However, it does not provide specific percentages or counts for the train/test/validation splits used for the datasets themselves or how the generated samples were split for the classifier's training and evaluation beyond general statements like 'test accuracy is evaluated using the ground-truth labels'.
Hardware Specification	Yes	Under our current implementation, a complete data generation run takes approximately 19 hours on a single NVIDIA A800, or 15.5 hours on 2 NVIDIA A100 GPUs, with a peak memory requirement of about 70 GB.
Software Dependencies	No	The paper mentions using specific models like GPT-4o-mini, Qwen-VL-Max, Meta-Llama-3-8B-Instruct, and SDXL-Turbo, but does not provide specific version numbers for underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation.
Experiment Setup	Yes	To facilitate reproducibility, we detail the hyperparameter settings used for all experiments in Section 4. For hyperparameters not mentioned, the default values are used. For experiment settings of ablation studies between image voting and text voting, we use the same hyperparameters in Table 6 below. For comparison experiments on SPTI, PE and DP fine-tuning, we present our configuration details in Table 7. For specific configuration of DP fine-tuning, we present our settings in Table 8.