Being Robust (in High Dimensions) Can Be Practical
Authors: Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, Alistair Stewart
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed an empirical evaluation of the above algorithms on synthetic and real data sets with and without synthetic noise. |
| Researcher Affiliation | Academia | 1University of Southern California, Los Angeles, California, USA 2Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3University of California, San Diego, La Jolla, California, USA. |
| Pseudocode | Yes | Procedure 1 Filter-based algorithm template for robust mean estimation |
| Open Source Code | No | The paper mentions obtaining the implementation for LRVMean and LRVCov from their respective Github repositories, but it does not provide concrete access to the source code for the methodology described in this paper developed by the authors. |
| Open Datasets | Yes | To demonstrate the efficacy of our method on real data, we revisit the famous study of Novembre et al. (2008). In this study, the authors investigated data collected as part of the POPRES project. ... While the original dataset is very high dimensional, we use a 20 dimensional version of the dataset as found in the authors Git Hub4. Footnote 4: https://github.com/Novembre Lab/Novembre_ etal_2008_misc |
| Dataset Splits | No | The paper mentions generating samples and adding noise but does not provide specific dataset split information (e.g., percentages or counts for training, validation, or test sets) needed to reproduce the data partitioning. |
| Hardware Specification | Yes | All experiments were done on a laptop computer with a 2.7 GHz Intel Core i5 CPU and 8 GB of RAM. |
| Software Dependencies | No | The paper mentions using implementations from third-party Github repositories but does not provide specific ancillary software details, such as library names with version numbers, for its own experiments. |
| Experiment Setup | Yes | In the synthetic mean experiment, we set ε = 0.1, and for dimension d = [100, 150, . . . , 400], we generate n = 10d ε2 samples, where a (1 ε)-fraction come from N(µ, I), and an ε fraction come from a noise distribution. ... We introduced a heuristic we call adaptive tail bounding. Our goal is to find a choice of C2 which throws away roughly an ε-fraction of points. The heuristic is fairly simple: we start with some initial guess for C2. We then run our filter with this C2. If we throw away too many data points, we increase our C2, and retry. If we throw away too few, then we decrease our C2 and retry. |