RDPD: Rich Data Helps Poor Data via Imitation

Authors: Shenda Hong, Cao Xiao, Trong Nghia Hoang, Tengfei Ma, Hongyan Li, Jimeng Sun

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated RDPD on three real-world datasets and shown that its distilled model consistently outperformed all baselines across all datasets, especially achieving the greatest performance improvement over a model trained only on low-quality data by 24.56% on PRAUC and 12.21% on ROC-AUC, and over that of a state-of-the-art KD model by 5.91% on PR-AUC and 4.44% on ROC-AUC.
Researcher Affiliation Collaboration Shenda Hong1,2,5 , Cao Xiao3 , Trong Nghia Hoang4 , Tengfei Ma4 , Hongyan Li1,2 and Jimeng Sun5 1School of Electronics Engineering and Computer Science, Peking University, China 2Key Laboratory of Machine Perception (Ministry of Education), Peking University, China 3Analytics Center of Excellence, IQVIA, USA 4IBM Research, USA 5Department of Computational Science and Engineering, Georgia Institute of Technology, USA
Pseudocode Yes Algorithm 1 RDPD (Xr, Xp, Y , T)
Open Source Code Yes Our code is publicly available at https://github.com/hsd1503/RDPD.
Open Datasets Yes PAMAP2 Physical Activity Monitoring Data Set (PAMAP2) [Reiss and Stricker, 2012] The PTB Diagnostic ECG Database (PTBDB) includes 15 channels of ECG signals collected from controls and patients of heart diseases [Bousseljot et al., 1995] The Medical Information Mart for Intensive Care (MIMIC-III) is collected on over 58, 000 ICU patients at the Beth Israel Deaconess Medical Center (BIDMC) from June 2001 to October 2012 [Johnson et al., 2016].
Dataset Splits Yes In our experiment, we choose data of subject 105 for validation, subject 101 for testing, and others for training. In our experiment, we random divided the data into training (80%), validation (10%) and test (10%) sets by subjects. In our experiment, we random divided the data into training (80%), validation (10%) and test (10%) sets by patients.
Hardware Specification Yes All models were implemented in Py Torch version 0.5.0., and trained with a system equipped with 64GB RAM, 12 Intel Core i7-6850K 3.60GHz CPUs and Nvidia Ge Force GTX 1080.
Software Dependencies Yes All models were implemented in Py Torch version 0.5.0.
Experiment Setup Yes Models are trained with the mini-batch of 128 samples for 200 iterations, which was a sufficient number of iterations for achieving the best performance for the classification task. All models were optimized using Adam [Kingma and Ba, 2014], with the learning rate set to 0.001. T is set to 5 for PAMAP2 and PTBDB, and set to 2.5 for MIMIC-III.