Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Training Data Subset Selection for Regression with Controlled Generalization Error
Authors: Durga S, Rishabh Iyer, Ganesh Ramakrishnan, Abir De
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experimental results and analysis on several real-world datasets to evaluate the performance of SELCON against several competitive baselines. |
| Researcher Affiliation | Academia | 1CSE Department, Indian Institute of Technology, Bombay 2CS Department, University of Texas at Dallas. |
| Pseudocode | Yes | Algorithm 1 SELCON Algorithm |
| Open Source Code | Yes | Our code and data is available at https://github.com/abir-de/SELCON |
| Open Datasets | Yes | We experiment with five real world datasets, viz., Cadata (16718 instances), Law (20800 instances), NYSE-High (701348 instances), NYSE-Close (701348 instances), and Community-and-crime (1994 instances), all briefly described in Appendix D. ... Cadata (Pace & Barry, 1997): This dataset is available in scikit-learn. ... Law (Wightman, 1998)... NYSE (https://github.com/marefaand/stock_market_data): ... Community and Crime: ... is available at UCI ML repository. |
| Dataset Splits | Yes | In each experiment, we used (random) 89% training, 1% validation and 10% test folds. |
| Hardware Specification | No | The paper mentions 'GPUs, multicore processors, high storage disks' in a general context, but does not provide specific hardware details like exact GPU/CPU models or memory used for their experiments. |
| Software Dependencies | No | The paper mentions 'pytorch' but does not specify its version number, nor does it list other software dependencies with their versions. |
| Experiment Setup | Yes | Specifically, we set N = 2000 for Cadata and Law , N = 5000 for the NYSE datasets; and, b = min{|S|, 1000} across all datasets. Additionally, SELCON involves two more sets of small scale optimization problems (lines 3 and 8 respectively), where we set the number of epochs as 3. |