h-detach: Modifying the LSTM Gradient Towards Better Optimization

Authors: Bhargav Kanuparthi, Devansh Arpit, Giancarlo Kerg, Nan Rosemary Ke, Ioannis Mitliagkas, Yoshua Bengio

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets.
Researcher Affiliation Academia 1Montreal Institute for Learning Algorithms (MILA), Canada 2CIFAR Senior Fellow
Pseudocode Yes Algorithm 1 Forward Pass of h-detach Algorithm
Open Source Code Yes Our code is available at https://github.com/bhargav104/h-detach.
Open Datasets Yes Using their data generation process, we sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. We use the Microsoft COCO dataset (Lin et al., 2014) which contains 82,783 training images and 40,504 validation images. ... MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
Dataset Splits Yes We sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. ... we follow the setting in Karpathy & Fei-Fei (2015) which suggests a split of 80,000 training images and 5,000 images each for validation and test set.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments. It only mentions general setups like 'training an LSTM'.
Software Dependencies No The paper mentions 'ADAM optimizer' but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup Yes We use the ADAM optimizer with batch-size 100, learning rate 0.001 and clip the gradient norms to 1. We use the ADAM optimizer with different learning rates 0.001,0.0005 and 0.0001, and a fixed batch size of 100. We train for 200 epochs and pick our final model based on the best validation score. We use an LSTM with 100 hidden units. We train both the Resnet and LSTM models using the ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 10^-4.