Generate Synthetic Text Approximating the Private Distribution with Differential Privacy

Authors: Wenhao Zhao, Shaoyang Song, Chunlai Zhou

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comparisons with various baselines on different datasets, we demonstrate that our synthetic text can closely match the utility of private text, while providing privacy protection standards robust enough to resist membership inference attacks from malicious users.
Researcher Affiliation Academia Computer Science Dept, Renmin University of China, Beijing, CHINA {zhaowh, songshaoyang, czhou}@ruc.edu.cn
Pseudocode Yes Algorithm 1: Synthetic Text Generation
Open Source Code No The paper does not include an unambiguous statement about releasing code for the work described, nor does it provide a direct link to a source-code repository.
Open Datasets Yes AGNews dataset [Zhang et al., 2015] consists of approximately 120,000 news articles categorized into four classes: World, Sports, Business, and Science/Technology. Disaster dataset [Bansal et al., 2019] originate from news reports or Twitter, with 4342 samples describing different disasters (e.g. fire, flood), while an additional 3271 samples could mention about any topic other than disasters. Trec [Voorhees and Tice, 2000] dataset comprises questions from 6 different categories, such as numbers, locations, etc.
Dataset Splits No The distribution of the 5500 questions in the training set and the 500 questions in the test set is uneven across these 6 question labels. (Explanation: The paper mentions training and test sets but does not specify a validation set or clear proportions for all splits across all datasets.)
Hardware Specification No The paper mentions using 'babbage (1.3B), curie (6.7B), and davinci (175B)' models, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or for training.
Software Dependencies No GTR-base [Ni et al., 2021] model to embed public texts and private texts. (Explanation: The paper mentions specific models used, but does not list software dependencies with version numbers, such as programming languages or libraries.)
Experiment Setup Yes In initial population step: we select 1000 public texts as our initial population and GTR-base [Ni et al., 2021] model to embed public texts and private texts. In private selection step: we follow the common practice to set δ = 1/|D| where |D| is the size of private dataset. The domain size of being able to become a parent sample is 300, and 30 samples are selected from them. GPT2 model [Brown et al., 2020] is used to obtain the perplexity and filter out texts with perplexity exceeding the threshold of 50. In offspring generation step: a large smoothing parameter α will lead to a high degree of homogenization among the final synthetic texts. Therefore, we set α as 0.1 and sample 3000 samples from the updated distribution for the next iteration.