Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scalable Private Partition Selection via Adaptive Weighting
Authors: Justin Y. Chen, Vincent Cohen-Addad, Alessandro Epasto, Morteza Zadimoghaddam
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct experiments on several publicly-available datasets with up to 800 billions of (user, item) pairs (up to three orders of magnitude larger than prior datasets used in sequential algorithms). Our algorithm outperforms scalable baselines and is competitive with the sequential baselines. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology, Cambridge, MA, USA 2Google Research, New York, NY, USA 3Google Research, Zurich, Switzerland. Correspondence to: Justin Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Meta-algorithm for private partition selection. Weight And Threshold(S, ε, δ, 0, ALG, h)... (Page 4), Algorithm 2 MAD(S, τ, dmax, b, bmin, bmax)... (Page 5), Algorithm 3 Basic (Appendix A.1), Algorithm 4 User Weights (Appendix A.2), Algorithm 5 MAD2R (Appendix A.2). |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. It refers to existing differentially private libraries (e.g., Py DP, Google's DP Libraries, Open Mined DP Library) but does not state that the authors' own implementation of MAD/MAD2R is open-source or provide a link. |
| Open Datasets | Yes | We consider 9 datasets with statistics detailed in Table 1... Higgs (Leskovec & Krevl, 2014)... IMDb (Maas et al., 2011)... Reddit (Gopi et al., 2020)... Finance (Aenlle)... Wiki (Wijkhuizen)... Twitter (Axelbrooke, 2017)... Amazon (Mc Auley & Leskovec, 2013; Zhang et al., 2015)... Clueweb (Boldi et al., 2011) and Common Crawl1. The latter has approximately 2 billion distinct items and 800 billion entries. 1https://www.commoncrawl.org/ |
| Dataset Splits | No | The paper describes data processing steps like subsampling and capping user degrees, and how items are represented for text datasets, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts needed to reproduce experimental evaluation. |
| Hardware Specification | Yes | We implement all parallel algorithms (MAD, MAD2R, Basic, DP-SIPS) using C++ in a modern multi-machine massively parallel computation framework in our institution. This framework allows to use a fleet of shared (x86 64) architecture machines with 2.45GHz clocks. The machines are shared by several projects and can have up to 256 cores and up to 512GB of RAM. |
| Software Dependencies | No | The paper mentions that algorithms were implemented using Python and C++, but it does not provide specific version numbers for these languages or any key libraries or dependencies used. |
| Experiment Setup | Yes | Unless otherwise specified, we use ε = 1, δ = 10 5, and 0 = 100... For Policy Gaussian and Greedy Update, we set the β = 4... For DP-SIPS, we take the best result of running with a privacy split of [0.1, 0.9] and [0.05, 0.15, 0.8]4. For MAD and MAD2R, we set dmax = 50 and β = 2. For MAD2R, we set the privacy split of [0.1, 0.9], bmin = 0.5, bmax = 2, Clb = 1, and Cub = 3. |