reproducibilityindex.ai

h-detach: Modifying the LSTM Gradient Towards Better Optimization

Authors: Bhargav Kanuparthi, Devansh Arpit, Giancarlo Kerg, Nan Rosemary Ke, Ioannis Mitliagkas, Yoshua Bengio

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show signiﬁcant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modiﬁcation of LSTM gradient on various benchmark datasets.
Researcher Affiliation	Academia	1Montreal Institute for Learning Algorithms (MILA), Canada 2CIFAR Senior Fellow
Pseudocode	Yes	Algorithm 1 Forward Pass of h-detach Algorithm
Open Source Code	Yes	Our code is available at https://github.com/bhargav104/h-detach.
Open Datasets	Yes	Using their data generation process, we sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. We use the Microsoft COCO dataset (Lin et al., 2014) which contains 82,783 training images and 40,504 validation images. ... MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
Dataset Splits	Yes	We sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. ... we follow the setting in Karpathy & Fei-Fei (2015) which suggests a split of 80,000 training images and 5,000 images each for validation and test set.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments. It only mentions general setups like 'training an LSTM'.
Software Dependencies	No	The paper mentions 'ADAM optimizer' but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup	Yes	We use the ADAM optimizer with batch-size 100, learning rate 0.001 and clip the gradient norms to 1. We use the ADAM optimizer with different learning rates 0.001,0.0005 and 0.0001, and a ﬁxed batch size of 100. We train for 200 epochs and pick our ﬁnal model based on the best validation score. We use an LSTM with 100 hidden units. We train both the Resnet and LSTM models using the ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 10^-4.