Structured Sparsification of Gated Recurrent Neural Networks

Authors: Ekaterina Lobacheva, Nadezhda Chirkova, Alexander Markovich, Dmitry Vetrov4989-4996

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our approach on the text classification and language modeling tasks. Our method improves the neuron-wise compression of the model in most of the tasks. We perform experiments with LSTM architecture in both sparsification frameworks.
Researcher Affiliation Collaboration Ekaterina Lobacheva,1 Nadezhda Chirkova,1 Alexander Markovich,2 Dmitry Vetrov1,3 1Samsung-HSE Laboratory, National Research University Higher School of Economics 2National Research University Higher School of Economics 3Samsung AI Center Moscow Moscow, Russia {elobacheva, nchirkova, dvetrov}@hse.ru, amarkovich@edu.hse.ru
Pseudocode Yes Algorithm 1 Forward pass through Bayesian LSTM for one sequence. Require: [x1, . . . x T ], c0, h0 Require: Parameters μ, σ, b 1: Sample ϵx i , ϵh i , . . . , ϵi, ϵf, . . . N(0, I) 2: W x i = μx i + ϵx i σx i , . . . ; zi = μi + ϵi σi, . . . 3: // sampling reparametrized weights 4: for t = 1, . . . , T: do 5: ft = sigm W x f xt + W h f ht 1 zf + bf similarly for it, ot, gt 6: ct = ft ct 1 + it gt, ht = ot tanh(ct) zh return [c1, . . . , c T ], [h1, . . . , h T ]
Open Source Code No No explicit statement providing concrete access (e.g., a link or explicit release statement) to the source code for the methodology described in this paper was found.
Open Datasets Yes Penn Treebank (PTB) dataset (Marcus, Marcinkiewicz, and Santorini 1993). Internet Movie Database (IMDb) dataset (Maas et al. 2011) for binary classification, and AG s Corpus of News Articles (AGNews) dataset (Zhang, Zhao, and Le Cun 2015) for four-class classification
Dataset Splits Yes For language modeling, we evaluate quality on the validation and test sets. The strengths of the individual and group Lasso regularizations are selected using grid search, so that the validation perplexities of ISS, and our model, are approximately equal. we compute the gradients of each hidden neuron of the second LSTM layer w. r. t. the input of this layer, at different lag t, and average the norm of this gradient, over the validation set
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions 'standard Tensor Flow implementation' and uses optimization algorithms like 'Adam', but it does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes All the small models, including the baseline model, are trained without dropout... for 20 epochs, with Stochastic Gradient Descent (SGD), with a decaying learning rate schedule: the initial learning rate is equal to 1, the learning rate starts to decay after the 4th epoch, the learning rate decay is equal to 0.6. For the two-level sparsification (W+N), we use group Lasso regularization with λ1 = 0.002, and Lasso regularization with λ2 = 1e 5. For the three-level sparsification (W+G+N), we use group Lasso regularization with λ1 = 0.0017, and Lasso regularization with λ2 = 1e 5. We use the threshold 1e 4 to prune the weights in both models during training. We train our networks using Adam (Kingma and Ba 2015). For the text classification tasks, we use the learning rate equal to 0.0005 and train the Bayesian models for 800 / 150 epochs on IMDb / AGNews. For the language modeling tasks, we train the Bayesian models for 250 / 50 epochs on character-level / word-level tasks using the learning rate of 0.002. For all the weights that we sparsify, we initialize log σ with -3. We eliminate the weights with the signal-to-noise ratio less than τ = 0.05.