reproducibilityindex.ai

Training Data Subset Selection for Regression with Controlled Generalization Error

Authors: Durga S, Rishabh Iyer, Ganesh Ramakrishnan, Abir De

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experimental results and analysis on several real-world datasets to evaluate the performance of SELCON against several competitive baselines.
Researcher Affiliation	Academia	1CSE Department, Indian Institute of Technology, Bombay 2CS Department, University of Texas at Dallas.
Pseudocode	Yes	Algorithm 1 SELCON Algorithm
Open Source Code	Yes	Our code and data is available at https://github.com/abir-de/SELCON
Open Datasets	Yes	We experiment with ﬁve real world datasets, viz., Cadata (16718 instances), Law (20800 instances), NYSE-High (701348 instances), NYSE-Close (701348 instances), and Community-and-crime (1994 instances), all brieﬂy described in Appendix D. ... Cadata (Pace & Barry, 1997): This dataset is available in scikit-learn. ... Law (Wightman, 1998)... NYSE (https://github.com/marefaand/stock_market_data): ... Community and Crime: ... is available at UCI ML repository.
Dataset Splits	Yes	In each experiment, we used (random) 89% training, 1% validation and 10% test folds.
Hardware Specification	No	The paper mentions 'GPUs, multicore processors, high storage disks' in a general context, but does not provide specific hardware details like exact GPU/CPU models or memory used for their experiments.
Software Dependencies	No	The paper mentions 'pytorch' but does not specify its version number, nor does it list other software dependencies with their versions.
Experiment Setup	Yes	Speciﬁcally, we set N = 2000 for Cadata and Law , N = 5000 for the NYSE datasets; and, b = min{\|S\|, 1000} across all datasets. Additionally, SELCON involves two more sets of small scale optimization problems (lines 3 and 8 respectively), where we set the number of epochs as 3.