Long Short-Term Sample Distillation

Authors: Liang Jiang, Zujie Wen, Zhongping Liang, Yafang Wang, Gerard de Melo, Zhe Li, Liangzhuang Ma, Jiaxing Zhang, Xiaolong Li, Yuan Qi4345-4352

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experimental results across a range of vision and NLP tasks demonstrate the effectiveness of this new training method.
Researcher Affiliation Collaboration Liang Jiang,1 Zujie Wen,1 Zhongping Liang,1 Yafang Wang,1 Gerard de Melo,2 Zhe Li,1 Liangzhuang Ma,1 Jiaxing Zhang,1 Xiaolong Li,1 Yuan Qi1 1AI Department, Ant Financial Services Group, 2Rutgers University {tianxuan.jl, zujie.wzj, zhongping.lzp, yafang.wyf}@antfin.com, gdm@demelo.org
Pseudocode Yes Algorithm 1 Long Short-Term Sample Distillation
Open Source Code No The paper does not provide a statement about releasing source code or a link to a code repository.
Open Datasets Yes For vision, we evaluate LSTSD on the CIFAR100 dataset, which contains 60,000 RGB images of 32 32 size, split into a training set of 50,000 images and a testing set of 10,000 images. ... For NLP, we used the well-known GLUE benchmark data (Wang et al. 2019)...
Dataset Splits No The paper specifies a training set and a testing set for CIFAR100 ('split into a training set of 50,000 images and a testing set of 10,000 images'), but it does not explicitly detail a validation split or its size for any of the datasets used.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions optimizers like SGD and Adam but does not specify version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes Res Nets are trained for 164 epochs with a batch size of 128, while Dense Nets are trained for 300 epochs with a batch size of 64. We trained both Res Nets and Dense Nets using SGD with a weight decay of 0.0001, a Nesterov momentum of 0.9 and a base learning rate of 0.1, which was divided by 10 at the 25%, 50%, 75% of the training process. ... We found the best λS = 4.0, λL = 2.4 and length of mini-generation to 6 epochs for LSTSD...