Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic Regression
Authors: Alexander Munteanu, Simon Omlor, David Woodruff
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS We implemented our new sketching algorithm into the framework of Munteanu et al. (2021)5. Pseudocode can be found in Appendix G. The crucial difference is that at level 0 of our sketch, each element gets mapped to multiple buckets instead of only one. ... Real-world benchmark data was downloaded automatically by the Python scripts: the Covertype data consists of 581, 012 cartographic observations of different forests with 54 features. ... We created a synthetic data set that has multiple heavy hitters for the sake of showing the benefits of the new sketch. ... We see in Figure 1 (bottom left) that the increase in the number of buckets each elements is hashed to improves from approximation ratios between 7 and 8 for the old sketch to 3 or even 2 approximations for the modified sketches. |
| Researcher Affiliation | Academia | Alexander Munteanu & Simon Omlor Faculty of Statistics TU Dortmund University 44227 Dortmund, Germany alexander.munteanu@tu-dortmund.de simon.omlor@tu-dortmund.de David P. Woodruff Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA dwoodruf@cs.cmu.edu |
| Pseudocode | Yes | Pseudocode can be found in Appendix G. ... Algorithm 1 implements the first step of the sketch & solve paradigm for approximating logistic or ℓ1 regression. |
| Open Source Code | Yes | We implemented our new sketching algorithm into the framework of Munteanu et al. (2021)5. Pseudocode can be found in Appendix G. The crucial difference is that at level 0 of our sketch, each element gets mapped to multiple buckets instead of only one. 5code available at https://github.com/Tim907/oblivious_sketching_varreglogreg |
| Open Datasets | Yes | Real-world benchmark data was downloaded automatically by the Python scripts: the Covertype data consists of 581, 012 cartographic observations of different forests with 54 features. The Webspam data consists of 350, 000 unigrams with 127 features from web pages. The Kddcup data consists of 494, 021 network connections with 41 features. |
| Dataset Splits | No | The paper mentions using "real-world benchmark data" (Covertype, Webspam, Kddcup) and a "synthetic data set" but does not specify any training, validation, or test dataset splits (e.g., 80/10/10 split, or exact sample counts for each split). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory specifications, or cloud resources). |
| Software Dependencies | No | The paper mentions implementation within "the framework of Munteanu et al. (2021)" and the use of "Python scripts" for data download, but it does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific framework versions) necessary for replication. |
| Experiment Setup | Yes | We implemented our new sketching algorithm into the framework of Munteanu et al. (2021)5. Pseudocode can be found in Appendix G. The crucial difference is that at level 0 of our sketch, each element gets mapped to multiple buckets instead of only one. Sketch (old) denotes the sketch used in (Munteanu et al., 2021) and is highlighted in red in the plots. Sketchs is the sketch where each entry is mapped to s {2, 5, 10} buckets at level 0. Each sketch was run with 40 repetitions for various target sizes. ... We see in Figure 1... for the regularization parameter λ {0, .1}... |