De-Anonymizing Text by Fingerprinting Language Generation

Authors: Zhen Sun, Roei Schuster, Vitaly Shmatikov

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure pairwise Euclidean distances between the NSS of variable sequences and the NSS of other (not necessarily variable) sequences. Figure 2a shows the histogram (smoothed by averaging over a 10-bucket window) for 500 randomly chosen 2700-word sequences from the Ok Cupid dataset. After measuring 1566 traces from the reddit-sports dataset and removing noisy traces, we fit a normal distribution and set d(N) to 10 standard deviations above the mean.
Researcher Affiliation Academia Zhen Sun Cornell University zs352@cornell.edu Roei Schuster Cornell Tech, Tel Aviv University rs864@cornell.edu Vitaly Shmatikov Cornell Tech shmat@cs.cornell.edu
Pseudocode Yes Algorithm 1 Nucleus sampling [24] and Algorithm 3 Top-p filtering with a fixed number of loop iterations.
Open Source Code No We disclosed our findings and our proposed mitigation code by email to members of the Hugging Face engineering team responsible for the implementation of nucleus sampling (identified via a contact at Hugging Face and Git Hub s commit log) and a message to Hugging Face s public Facebook contact point. (This describes disclosure, not public release of their own source code.)
Open Datasets Yes We downloaded 5 subreddit archives from Convokit [8] that have the fewest common users... We used an archive of Silk Road forum posts [39]... We selected the 200 most active users from the Ubuntu Chat corpus [41].
Dataset Splits No The paper describes concatenating user posts into sequences and simulating auto-completion, but it does not specify explicit training, validation, or test dataset splits or percentages for their experimental setup.
Hardware Specification Yes The victim and attacker run as (isolated) processes on the same core of an 8-core, Intel Xeon E5-1660 v4 CPU.
Software Dependencies Yes We used Hugging Face [23] and Pytorch [34] code versions from, respectively, 7/18/2019 and 7/22/2019.
Experiment Setup Yes We use nucleus size with q = 0.9. To reduce computational complexity, we modified the script to save the encoder s hidden state for every prefix, so it is necessary to decode only one additional word for the next prefix.