reproducibilityindex.ai

De-Anonymizing Text by Fingerprinting Language Generation

Authors: Zhen Sun, Roei Schuster, Vitaly Shmatikov

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure pairwise Euclidean distances between the NSS of variable sequences and the NSS of other (not necessarily variable) sequences. Figure 2a shows the histogram (smoothed by averaging over a 10-bucket window) for 500 randomly chosen 2700-word sequences from the Ok Cupid dataset. After measuring 1566 traces from the reddit-sports dataset and removing noisy traces, we ﬁt a normal distribution and set d(N) to 10 standard deviations above the mean.
Researcher Affiliation	Academia	Zhen Sun Cornell University zs352@cornell.edu Roei Schuster Cornell Tech, Tel Aviv University rs864@cornell.edu Vitaly Shmatikov Cornell Tech shmat@cs.cornell.edu
Pseudocode	Yes	Algorithm 1 Nucleus sampling [24] and Algorithm 3 Top-p ﬁltering with a ﬁxed number of loop iterations.
Open Source Code	No	We disclosed our ﬁndings and our proposed mitigation code by email to members of the Hugging Face engineering team responsible for the implementation of nucleus sampling (identiﬁed via a contact at Hugging Face and Git Hub s commit log) and a message to Hugging Face s public Facebook contact point. (This describes disclosure, not public release of their own source code.)
Open Datasets	Yes	We downloaded 5 subreddit archives from Convokit [8] that have the fewest common users... We used an archive of Silk Road forum posts [39]... We selected the 200 most active users from the Ubuntu Chat corpus [41].
Dataset Splits	No	The paper describes concatenating user posts into sequences and simulating auto-completion, but it does not specify explicit training, validation, or test dataset splits or percentages for their experimental setup.
Hardware Specification	Yes	The victim and attacker run as (isolated) processes on the same core of an 8-core, Intel Xeon E5-1660 v4 CPU.
Software Dependencies	Yes	We used Hugging Face [23] and Pytorch [34] code versions from, respectively, 7/18/2019 and 7/22/2019.
Experiment Setup	Yes	We use nucleus size with q = 0.9. To reduce computational complexity, we modiﬁed the script to save the encoder s hidden state for every preﬁx, so it is necessary to decode only one additional word for the next preﬁx.