Fast Multi-stage Submodular Maximization

Authors: Kai Wei, Rishabh Iyer, Jeff Bilmes

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theory by empirically evaluating on several real-world problems, including data subset selection on millions of speech samples where MULTGREED yields at least a thousand times speedup and superior results over the state-of-the-art selection methods.
Researcher Affiliation Academia University of Washington, Seattle, WA 98195, USA
Pseudocode Yes Algorithm 1 MULTGREED
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes All simulations are performed on the same data with size |V | = 20, 000, formed by randomly sampling from a large speech recognition training corpus (the Fisher corpus). Each sample pair has a similarity score, and the graph-based submodular functions ffac and fsat are instantiated using the corresponding similarity matrix. A set of features F sized |F| 75000 is derived from the same data to instantiate ffea. In all runs of MULTGREED, we set {βi}ℓ i=1 using the schedule βi = c + (i 1)(1 c) ℓ with c = 0.5. Performance of MULTGREED and LAZYGREED is measured by the function valuations and the wall-clock running time. ... We subselect 1300 hours of conversational English telephone data from the Switchboard , Switchboard Cellular , and Fisher corpora, which, in total, comprises 1,322,108 segments of speech (i.e., |V | = n = 1, 322, 018). ... (King et al., 2005; Lin and Bilmes, 2009; Wei et al., 2013)
Dataset Splits No The paper does not explicitly state the use of a validation set or provide specific details on train/validation/test splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes In all runs of MULTGREED, we set {βi}ℓ i=1 using the schedule βi = c + (i 1)(1 c) ℓ with c = 0.5. ... For fsat, the saturation ratio ξ is set as 0.25. Two stages using surrogate functions ˆfmod and ˆfsub are applied, under size constraints ℓ1 = nξ 5(1 ξ)+ξ = 0.05n, and ℓ2 = ℓ ℓ1. ... Next, for ffea, we set g to be the square root function. Two stages of surrogates ˆfmod and ˆfsub are applied. ˆfsub is defined on a randomly selected feature subset of size 37,500. We test with different combinations of size constraints ℓ1 and ℓ2 by setting ℓ1 {0, 0.25ℓ, 0.5ℓ, 0.75ℓ} with ℓ2 = ℓ ℓ1. ... We here test MULTGREED using ˆfk-NNG fac with the sparsity of k-NNG set as k = 1, 000.