Stratified Sampling Meets Machine Learning

Authors: Edo Liberty, Kevin Lang, Konstantin Shmakov

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards. In this section we present an array of experimental results using our algorithm. We compare it to uniform sampling and stratified sampling. We also study the effects of varying the number of training example and strength of the regularization. This is done for both synthetic and real datasets.
Researcher Affiliation Industry Kevin Lang LANGK@YAHOO-INC.COM Yahoo Research Edo Liberty EDO@YAHOO-INC.COM Yahoo Research Konstantin Shmakov KSHMAKOV@YAHOO-INC.COM Yahoo Research
Pseudocode Yes Algorithm 1 Train: regularized ERM algorithm. Algorithm 2 Test: measure expected test error.
Open Source Code No The paper does not contain any statement about making its source code openly available or provide a link to a code repository.
Open Datasets Yes DBLP Dataset In this dataset we use a real database from DBLP and synthetic queries. Records correspond to 2,101,151 academic papers from the DBLP public database (database). From the publicly available DBLP database XML file we selected all papers from the 1000 most populous venues. database, DBLP XML. http://dblp.uni-trier.de/xml/.
Dataset Splits No The paper specifies training and testing splits, e.g., 'The 50,000 random queries were split into 40,000 for training and 10,000 for testing.' but does not mention a separate validation split.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory size, or cluster specifications) used for running the experiments.
Software Dependencies No The paper does not list any specific software dependencies or libraries with version numbers required to reproduce the experiments.
Experiment Setup Yes Our experiments focus exclusively on the relative error defined by L(ˆy, y) = (ˆy/y − 1)^2. As a practical shortcut, this is achievable without modifying Algorithm 1 at all. The only modification needed is normalizing all training queries such that y = 1 before executing Algorithm 1. We also study the effects of varying the number of training example and strength of the regularization. The x-axis in Figure 2 varies with the value of the parameter η which controls the strength of regularization.