h-detach: Modifying the LSTM Gradient Towards Better Optimization
Authors: Bhargav Kanuparthi, Devansh Arpit, Giancarlo Kerg, Nan Rosemary Ke, Ioannis Mitliagkas, Yoshua Bengio
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets. |
| Researcher Affiliation | Academia | 1Montreal Institute for Learning Algorithms (MILA), Canada 2CIFAR Senior Fellow |
| Pseudocode | Yes | Algorithm 1 Forward Pass of h-detach Algorithm |
| Open Source Code | Yes | Our code is available at https://github.com/bhargav104/h-detach. |
| Open Datasets | Yes | Using their data generation process, we sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. We use the Microsoft COCO dataset (Lin et al., 2014) which contains 82,783 training images and 40,504 validation images. ... MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/. |
| Dataset Splits | Yes | We sample 100,000 training input-target sequence pairs and 5,000 validation pairs. We use 50000 images for training, 10000 for validation and 10000 for testing. ... we follow the setting in Karpathy & Fei-Fei (2015) which suggests a split of 80,000 training images and 5,000 images each for validation and test set. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments. It only mentions general setups like 'training an LSTM'. |
| Software Dependencies | No | The paper mentions 'ADAM optimizer' but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., Python version, TensorFlow/PyTorch version). |
| Experiment Setup | Yes | We use the ADAM optimizer with batch-size 100, learning rate 0.001 and clip the gradient norms to 1. We use the ADAM optimizer with different learning rates 0.001,0.0005 and 0.0001, and a fixed batch size of 100. We train for 200 epochs and pick our final model based on the best validation score. We use an LSTM with 100 hidden units. We train both the Resnet and LSTM models using the ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 10^-4. |