De-Anonymizing Text by Fingerprinting Language Generation
Authors: Zhen Sun, Roei Schuster, Vitaly Shmatikov
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure pairwise Euclidean distances between the NSS of variable sequences and the NSS of other (not necessarily variable) sequences. Figure 2a shows the histogram (smoothed by averaging over a 10-bucket window) for 500 randomly chosen 2700-word sequences from the Ok Cupid dataset. After measuring 1566 traces from the reddit-sports dataset and removing noisy traces, we fit a normal distribution and set d(N) to 10 standard deviations above the mean. |
| Researcher Affiliation | Academia | Zhen Sun Cornell University zs352@cornell.edu Roei Schuster Cornell Tech, Tel Aviv University rs864@cornell.edu Vitaly Shmatikov Cornell Tech shmat@cs.cornell.edu |
| Pseudocode | Yes | Algorithm 1 Nucleus sampling [24] and Algorithm 3 Top-p filtering with a fixed number of loop iterations. |
| Open Source Code | No | We disclosed our findings and our proposed mitigation code by email to members of the Hugging Face engineering team responsible for the implementation of nucleus sampling (identified via a contact at Hugging Face and Git Hub s commit log) and a message to Hugging Face s public Facebook contact point. (This describes disclosure, not public release of their own source code.) |
| Open Datasets | Yes | We downloaded 5 subreddit archives from Convokit [8] that have the fewest common users... We used an archive of Silk Road forum posts [39]... We selected the 200 most active users from the Ubuntu Chat corpus [41]. |
| Dataset Splits | No | The paper describes concatenating user posts into sequences and simulating auto-completion, but it does not specify explicit training, validation, or test dataset splits or percentages for their experimental setup. |
| Hardware Specification | Yes | The victim and attacker run as (isolated) processes on the same core of an 8-core, Intel Xeon E5-1660 v4 CPU. |
| Software Dependencies | Yes | We used Hugging Face [23] and Pytorch [34] code versions from, respectively, 7/18/2019 and 7/22/2019. |
| Experiment Setup | Yes | We use nucleus size with q = 0.9. To reduce computational complexity, we modified the script to save the encoder s hidden state for every prefix, so it is necessary to decode only one additional word for the next prefix. |