Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations
Authors: Edward Raff, Amol Khanna, Fred Lu
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that this procedure can reduce runtime by a factor of up to 2, 200 , depending on the value of the privacy parameter ϵ and the sparsity of the dataset. We test our algorithm on high-dimensional problems with up to 8.4 million datapoints and 20 million features on a single machine and find speedups ranging from 10 to 2, 200 that of the standard DP Frank-Wolfe method. We test our algorithm on high-dimensional problems with up to 8.4 million datapoints and 20 million features on a single machine and find speedups ranging from 10 to 2, 200 that of the standard DP Frank-Wolfe method. |
| Researcher Affiliation | Industry | Edward Raff Booz Allen Hamilton EMAIL Amol Khanna Booz Allen Hamilton EMAIL Fred Lu Booz Allen Hamilton EMAIL |
| Pseudocode | Yes | Algorithm 1 Standard Sparse-Aware Frank Wolfe, Algorithm 2 Fast Sparse-Aware Frank-Wolfe for Linear Models, Algorithm 3 Fibonacci Heap Frank-Wolfe Queue Maintenance, Algorithm 4 Big-Step Little-Step Exponential Sampler |
| Open Source Code | No | Our code was written in Java due to the need for explicit looping, and implemented using the JSAT library [36]. There is no explicit statement of code release or a repository link for the specific contributions of this paper. |
| Open Datasets | Yes | Table 2: Datasets used for evaluation. We focus on cases that are high-dimensional and sparse. Dataset N D RCV1 20,242 47,236 News20 |
| Dataset Splits | No | The paper uses the phrase "training iterations" and discusses evaluating on "test datasets" but does not explicitly mention validation splits or their proportions. |
| Hardware Specification | Yes | All experiments were run on a machine with 12 CPU cores (though only one core was used), and 128GB of RAM. |
| Software Dependencies | Yes | Our code was written in Java due to the need for explicit looping, and implemented using the JSAT library [36]. |
| Experiment Setup | Yes | Due to the computational requirements of running all tests, we fix the total number of iterations T = 4, 000 and maximum L1 norm for the Lasso constraint to be λ = 50 in all tests across all datasets. This value produces highly accurate models in all non-private cases, and the goal of our work is not to perform hyper-parameter tuning but to demonstrate that we have taken an already known algorithm with established convergence rates and made each iteration more efficient. |