Optimal Subsampling with Influence Functions

Authors: Daniel Ting, Eric Brochu

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared our sampling procedures with existing methods on three real datasets for least squares and quantile regression models. We did not include median L1 regression because it yields sampling with probability proportional to leverage which we did not consider to be an interesting comparison. Quantile regression, on the other hand, has a number of unique and applications beyond what can be found with a robust version of least squares: for instance, quality of service guarantees, understanding high-value donors for a nonprofit, or other cases where large values are of most interest. The datasets we used are the CASP [18] (n = 45730, d = 9) and Online News Popularity (n = 38644, d = 59) datasets from the UCI repository, which are the same as those used by [29], and the EPA Risk-screening Environmental Indicators dataset [12] (n 9.6M, d = 10), a much larger regression dataset.
Researcher Affiliation Industry Daniel Ting Tableau Software Seattle, WA, USA dting@tableau.com Eric Brochu Tableau Software Vancouver, BC, Canada ebrochu@tableau.com
Pseudocode Yes Algorithm 1 Basic Influence Based Sampling(X, Y ) Given a model and objective, derive the influence function ψθ or an approximation of it Compute a pilot estimate of the parameters ˆθ. Draw an importance sample taking (Xi, Yi) with probability πi ψˆθ(Xi, Yi) under the constraint πi α Fit a model with the sample using importance weights 1/πi
Open Source Code No The paper does not contain any explicit statements about making the source code available or provide a link to a code repository.
Open Datasets Yes The datasets we used are the CASP [18] (n = 45730, d = 9) and Online News Popularity (n = 38644, d = 59) datasets from the UCI repository, which are the same as those used by [29], and the EPA Risk-screening Environmental Indicators dataset [12] (n 9.6M, d = 10), a much larger regression dataset.
Dataset Splits No The paper describes using a uniform random sample for a pilot estimate and then drawing a weighted subsample, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, absolute counts, or references to predefined splits).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments. It only notes that run times were not compared.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, specific libraries like PyTorch or TensorFlow, or numerical solvers with their versions).
Experiment Setup No The paper describes data preprocessing steps and how pilot estimates are derived, but it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) for the models trained during the experiments.