Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SeqPATE: Differentially Private Text Generation via Knowledge Distillation

Authors: Zhiliang Tian, Yingxiu Zhao, Ziyue Huang, Yu-Xiang Wang, Nevin L. Zhang, He He

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments verify the effectiveness of Seq PATE in protecting both training samples and sensitive phrases.
Researcher Affiliation Academia 1 National University of Defense Technology, 2 The Hong Kong University of Science and Technology, 3 UC Santa Barbara, 4 New York University
Pseudocode Yes We show the training algorithm in App. B and a running example in App. K.
Open Source Code Yes Our code is publicly accessible. 7. (See details about hyperparameters in App. G) [Footnote 7: https://github.com/tianzhiliang/Seq PATE]
Open Datasets Yes We evaluate our model on two datasets. Air Dialog [47] consists of 1M utterances from customer service dialog on flight booking; Europarl_v6 consists of 2M English sentences collected from European Parliament.5 (See details about datasets in App. E.) [Footnote 5: www.statmt.org/europarl] [Reference 47: Wei Wei, Quoc Le, Andrew Dai, and Jia Li. Airdialogue: An environment for goal-oriented dialogue research. In EMNLP, pages 3844 3854, 2018.]
Dataset Splits Yes The coefficient λ that balances supervision for the teacher and the pseudo-data (Eq. 4) is set to 20, where we have tuned it on the validation set of the public pseudo-data.
Hardware Specification No The paper mentions running Seq PATE “on a single GPU” but does not specify the make, model, or any other details about the GPU or other hardware components (e.g., CPU, memory).
Software Dependencies No The paper mentions fine-tuning models from the “pre-trained GPT-2 model [35]” and using “Adam [23]” optimizer. However, it does not provide specific version numbers for GPT-2, Adam, or any underlying deep learning frameworks (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes The batch size is 32 for all comparing methods except the GC [24] (GC [24] requires 2048). We use Adam [23] and adjust the initial learning rate with a range of 10 3 to 10 6 for all methods. The δ mentioned in Sec. 5 for all DP methods is 10 6. The coefficient λ... is set to 20... The default number of teacher models is 2k... The threshold p is 0.95.