Training Data Subset Selection for Regression with Controlled Generalization Error

Authors: Durga S, Rishabh Iyer, Ganesh Ramakrishnan, Abir De

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results and analysis on several real-world datasets to evaluate the performance of SELCON against several competitive baselines.
Researcher Affiliation Academia 1CSE Department, Indian Institute of Technology, Bombay 2CS Department, University of Texas at Dallas.
Pseudocode Yes Algorithm 1 SELCON Algorithm
Open Source Code Yes Our code and data is available at https://github.com/abir-de/SELCON
Open Datasets Yes We experiment with five real world datasets, viz., Cadata (16718 instances), Law (20800 instances), NYSE-High (701348 instances), NYSE-Close (701348 instances), and Community-and-crime (1994 instances), all briefly described in Appendix D. ... Cadata (Pace & Barry, 1997): This dataset is available in scikit-learn. ... Law (Wightman, 1998)... NYSE (https://github.com/marefaand/stock_market_data): ... Community and Crime: ... is available at UCI ML repository.
Dataset Splits Yes In each experiment, we used (random) 89% training, 1% validation and 10% test folds.
Hardware Specification No The paper mentions 'GPUs, multicore processors, high storage disks' in a general context, but does not provide specific hardware details like exact GPU/CPU models or memory used for their experiments.
Software Dependencies No The paper mentions 'pytorch' but does not specify its version number, nor does it list other software dependencies with their versions.
Experiment Setup Yes Specifically, we set N = 2000 for Cadata and Law , N = 5000 for the NYSE datasets; and, b = min{|S|, 1000} across all datasets. Additionally, SELCON involves two more sets of small scale optimization problems (lines 3 and 8 respectively), where we set the number of epochs as 3.