Learning to Teach

Authors: Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, Tie-Yan Liu

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning to teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multi-layer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding).
Researcher Affiliation Collaboration School of Computer Science and Technology, University of Science and Technology of China fyabc@mail.ustc.edu.cn, xiangyangli@ustc.edu.cn Microsoft Research {fetia,taoqin,tyliu}@microsoft.com
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper states 'The code is based on a public Lasagne implementation' in Section 8.1.2, but it does not provide any statement or link indicating that the authors' own implementation code for the described methodology is open-sourced or publicly available.
Open Datasets Yes We conduct comprehensive experiments to test the effectiveness of the L2T framework: we consider three most widely used neural network architectures as the student models: multi-layer perceptron (MLP), convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and adopt three popular deep learning tasks: image classification for MNIST, for CIFAR-10 (Krizhevsky, 2009), and sentiment classification for IMDB movie review dataset (Maas et al., 2011).
Dataset Splits Yes We evenly split the training data Dtrain in each task into two folds: Dteacher train and Dstudent train . We conduct experiments as follows. Step 1: The first fold Dteacher train is used to train the teacher model, with 5% of Dteacher train acting as a held-out set D dev used to compute reward for the teacher model during training.
Hardware Specification Yes The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
Software Dependencies No The paper mentions 'The model is implemented with Theano', 'Adam (Kingma & Ba, 2014) is used to train the MLP and RNN student models', and 'Momentum-SGD (Sutskever et al., 2013) is used for the CNN student model', but it does not specify version numbers for any of these software components.
Experiment Setup Yes The student model obeys mini-batch stochastic gradient descent (SGD) as its learning rule (i.e., the arg min part in Eqn. 1). [...] The base model is a three-layer feedforward neural network with 784/500/10 neurons in its input/hidden/output layers. tanh acts as the activation function for the hidden layer. Cross-entropy loss is used for training. [...] The mini-batch size is set as M = 128 and Momentum-SGD Sutskever et al. (2013) is used as the optimization algorithm. Following the learning rate scheduling strategy in the original paper (He et al., 2015), we set the initial learning rate as 0.1 and multiply it by a factor of 0.1 after the 32k-th and 48k-th model update. [...] The size of word embedding in RNN is 256, the size of hidden state of RNN is 512, and the mini-batch size is set as M = 16. Adam (Kingma & Ba, 2014) is used to perform LSTM model training with early stopping based on validation set accuracy.