Less Is Better: Unweighted Data Subsampling via Influence Function

Authors: Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, Shao-Lun Huang6340-6347

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiment results demonstrate our methods superiority over existed subsampling methods in diverse tasks, such as text classification, image classification, click-through prediction, etc.
Researcher Affiliation Collaboration Zifeng Wang,1 Hong Zhu,2 Zhenhua Dong,2 Xiuqiang He,2 Shao-Lun Huang1 1Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, 2Noah s Ark Lab, Huawei
Pseudocode No The paper includes Fig. 2 titled 'Our unweighted subsampling framework', which is a flowchart diagram, not structured pseudocode or an algorithm block.
Open Source Code Yes The code can be found at https://github.com/RyanWangZf/InfluenceSubsampling
Open Datasets Yes We perform extensive experiments on various public data sets which conclude many domains, including computer vision, natural language processing, click-through rate prediction, etc. Additionally, we test the methods on the Company data set...The data set statistics and more details about preprocessing on some data sets are described in appendix E. (Mentions specific datasets like UCI breast-cancer, diabetes, News20, UCI Adult, cifar10, MNIST, real-sim, SVHN, skin-nonskin, Criteo1%, Covertype, Avazu-app, Avazu-site).
Dataset Splits Yes In our experiments, we use a Tr-Va-Te setting which is different from the Tr-Va setting as many previous work do (see the Fig. 4). Both settings proceed in three steps, and share the same first two steps: 1) training model ˆθ on the full Tr, predicting on the Va, then computing the IF; 2) getting sampling probability from the IF, doing sampling on Tr to get the subset, then acquiring the subset-model θ.
Hardware Specification Yes a Run on the Intel i7-6600U CPU @2.60GHz. b Run on the Intel Xeon CPU E5-2670 v3 @2.30GHz.
Software Dependencies No The paper mentions using logistic regression and methods like Preconditioned Conjugate Gradient (PCG) but does not provide specific version numbers for any software, libraries, or programming languages used.
Experiment Setup No The paper states 'More details about experimental settings can be found in appendix F.', indicating that specific hyperparameter values or training configurations are not present in the main text provided.