RDPD: Rich Data Helps Poor Data via Imitation
Authors: Shenda Hong, Cao Xiao, Trong Nghia Hoang, Tengfei Ma, Hongyan Li, Jimeng Sun
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated RDPD on three real-world datasets and shown that its distilled model consistently outperformed all baselines across all datasets, especially achieving the greatest performance improvement over a model trained only on low-quality data by 24.56% on PRAUC and 12.21% on ROC-AUC, and over that of a state-of-the-art KD model by 5.91% on PR-AUC and 4.44% on ROC-AUC. |
| Researcher Affiliation | Collaboration | Shenda Hong1,2,5 , Cao Xiao3 , Trong Nghia Hoang4 , Tengfei Ma4 , Hongyan Li1,2 and Jimeng Sun5 1School of Electronics Engineering and Computer Science, Peking University, China 2Key Laboratory of Machine Perception (Ministry of Education), Peking University, China 3Analytics Center of Excellence, IQVIA, USA 4IBM Research, USA 5Department of Computational Science and Engineering, Georgia Institute of Technology, USA |
| Pseudocode | Yes | Algorithm 1 RDPD (Xr, Xp, Y , T) |
| Open Source Code | Yes | Our code is publicly available at https://github.com/hsd1503/RDPD. |
| Open Datasets | Yes | PAMAP2 Physical Activity Monitoring Data Set (PAMAP2) [Reiss and Stricker, 2012] The PTB Diagnostic ECG Database (PTBDB) includes 15 channels of ECG signals collected from controls and patients of heart diseases [Bousseljot et al., 1995] The Medical Information Mart for Intensive Care (MIMIC-III) is collected on over 58, 000 ICU patients at the Beth Israel Deaconess Medical Center (BIDMC) from June 2001 to October 2012 [Johnson et al., 2016]. |
| Dataset Splits | Yes | In our experiment, we choose data of subject 105 for validation, subject 101 for testing, and others for training. In our experiment, we random divided the data into training (80%), validation (10%) and test (10%) sets by subjects. In our experiment, we random divided the data into training (80%), validation (10%) and test (10%) sets by patients. |
| Hardware Specification | Yes | All models were implemented in Py Torch version 0.5.0., and trained with a system equipped with 64GB RAM, 12 Intel Core i7-6850K 3.60GHz CPUs and Nvidia Ge Force GTX 1080. |
| Software Dependencies | Yes | All models were implemented in Py Torch version 0.5.0. |
| Experiment Setup | Yes | Models are trained with the mini-batch of 128 samples for 200 iterations, which was a sufficient number of iterations for achieving the best performance for the classification task. All models were optimized using Adam [Kingma and Ba, 2014], with the learning rate set to 0.001. T is set to 5 for PAMAP2 and PTBDB, and set to 2.5 for MIMIC-III. |