Being Robust (in High Dimensions) Can Be Practical

Authors: Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, Alistair Stewart

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed an empirical evaluation of the above algorithms on synthetic and real data sets with and without synthetic noise.
Researcher Affiliation Academia 1University of Southern California, Los Angeles, California, USA 2Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3University of California, San Diego, La Jolla, California, USA.
Pseudocode Yes Procedure 1 Filter-based algorithm template for robust mean estimation
Open Source Code No The paper mentions obtaining the implementation for LRVMean and LRVCov from their respective Github repositories, but it does not provide concrete access to the source code for the methodology described in this paper developed by the authors.
Open Datasets Yes To demonstrate the efficacy of our method on real data, we revisit the famous study of Novembre et al. (2008). In this study, the authors investigated data collected as part of the POPRES project. ... While the original dataset is very high dimensional, we use a 20 dimensional version of the dataset as found in the authors Git Hub4. Footnote 4: https://github.com/Novembre Lab/Novembre_ etal_2008_misc
Dataset Splits No The paper mentions generating samples and adding noise but does not provide specific dataset split information (e.g., percentages or counts for training, validation, or test sets) needed to reproduce the data partitioning.
Hardware Specification Yes All experiments were done on a laptop computer with a 2.7 GHz Intel Core i5 CPU and 8 GB of RAM.
Software Dependencies No The paper mentions using implementations from third-party Github repositories but does not provide specific ancillary software details, such as library names with version numbers, for its own experiments.
Experiment Setup Yes In the synthetic mean experiment, we set ε = 0.1, and for dimension d = [100, 150, . . . , 400], we generate n = 10d ε2 samples, where a (1 ε)-fraction come from N(µ, I), and an ε fraction come from a noise distribution. ... We introduced a heuristic we call adaptive tail bounding. Our goal is to find a choice of C2 which throws away roughly an ε-fraction of points. The heuristic is fairly simple: we start with some initial guess for C2. We then run our filter with this C2. If we throw away too many data points, we increase our C2, and retry. If we throw away too few, then we decrease our C2 and retry.