Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Being Robust (in High Dimensions) Can Be Practical

Authors: Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, Alistair Stewart

ICML 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We performed an empirical evaluation of the above algorithms on synthetic and real data sets with and without synthetic noise.
Researcher Affiliation	Academia	1University of Southern California, Los Angeles, California, USA 2Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3University of California, San Diego, La Jolla, California, USA.
Pseudocode	Yes	Procedure 1 Filter-based algorithm template for robust mean estimation
Open Source Code	No	The paper mentions obtaining the implementation for LRVMean and LRVCov from their respective Github repositories, but it does not provide concrete access to the source code for the methodology described in this paper developed by the authors.
Open Datasets	Yes	To demonstrate the efﬁcacy of our method on real data, we revisit the famous study of Novembre et al. (2008). In this study, the authors investigated data collected as part of the POPRES project. ... While the original dataset is very high dimensional, we use a 20 dimensional version of the dataset as found in the authors Git Hub4. Footnote 4: https://github.com/Novembre Lab/Novembre_ etal_2008_misc
Dataset Splits	No	The paper mentions generating samples and adding noise but does not provide specific dataset split information (e.g., percentages or counts for training, validation, or test sets) needed to reproduce the data partitioning.
Hardware Specification	Yes	All experiments were done on a laptop computer with a 2.7 GHz Intel Core i5 CPU and 8 GB of RAM.
Software Dependencies	No	The paper mentions using implementations from third-party Github repositories but does not provide specific ancillary software details, such as library names with version numbers, for its own experiments.
Experiment Setup	Yes	In the synthetic mean experiment, we set ε = 0.1, and for dimension d = [100, 150, . . . , 400], we generate n = 10d ε2 samples, where a (1 ε)-fraction come from N(µ, I), and an ε fraction come from a noise distribution. ... We introduced a heuristic we call adaptive tail bounding. Our goal is to ﬁnd a choice of C2 which throws away roughly an ε-fraction of points. The heuristic is fairly simple: we start with some initial guess for C2. We then run our ﬁlter with this C2. If we throw away too many data points, we increase our C2, and retry. If we throw away too few, then we decrease our C2 and retry.