Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection

Authors: Abhimanyu Das, David Kempe

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical analysis is complemented by experiments on real-world and synthetic data sets; in particular, we focus on an analysis of how tight various spectral parameters and the submodularity ratio are in terms of predicting the performance of the greedy algorithms. Our theoretical analysis is complemented by experiments comparing the performance of the greedy algorithms and a baseline convex-relaxation algorithm for subset selection on two real-world data sets and a synthetic data set.
Researcher Affiliation	Collaboration	Abhimanyu Das EMAIL Google David Kempe EMAIL Department of Computer Science University of Southern California
Pseudocode	Yes	Algorithm 1 The Nemhauser Greedy Algorithm for a non-negative, monotone, and submodular set function f on a universe X. Algorithm 2 The Forward Regression (also called Forward Selection) algorithm. Algorithm 3 The Orthogonal Matching Pursuit algorithm. Algorithm 4 The oblivious algorithm. Algorithm 5 The SDSMA algorithm for dictionary selection. Algorithm 6 The SDSOMP algorithm for dictionary selection.
Open Source Code	No	The paper mentions the 'l1 ls: Simple Matlab Solver for l1-regularized Least Squares Problems' as a tool used in experiments, referencing Koh et al. (2008), but does not provide source code specifically developed by the authors for the methodology described in this paper.
Open Datasets	Yes	Our data sets are the Boston Housing Data, a data set of World Bank Development Indicators, and a synthetic data set generated from a distribution similar to the one used by Zhang (2008). The Boston Housing Data (available from the UCI Machine Learning Repository) is a small data set frequently used to evaluate ML algorithms. The World Bank Data (available from http://databank.worldbank.org) contains an extensive list of socio-economic and health indicators of development, for many countries and over several years.
Dataset Splits	No	Each data set contains m > n samples, from which we compute the empirical covariance matrix (analogous to the Gram matrix in sparse approximation) between all observation variables and the predictor variable; we then normalize it to obtain C and b. We evaluate the performance of all algorithms in terms of their R2 fit; thus, we implicitly treat C and b as the ground truth, and also do not separate the data sets into training and test cases.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments. It only mentions running 'experiments' or 'algorithms' in a general sense.
Software Dependencies	No	The paper mentions using the 'Lasso (L1) algorithm (in the implementation of Koh et al. (2008))' and references 'l1 ls: Simple Matlab Solver for l1-regularized Least Squares Problems'. While it identifies a specific solver, it does not provide a version number for 'l1 ls' or 'Matlab', which is required for reproducibility.
Experiment Setup	No	The paper describes the algorithms used (Forward Regression, OMP, Lasso, Oblivious) and the range of 'k' values tested (from 2 through 8 for subset selection). However, it does not provide specific hyperparameters or system-level training settings for these algorithms, such as learning rates, regularization strengths for Lasso, or detailed convergence criteria.