Twin Networks: Matching the Future for Sequence Generation
Authors: Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris Pal, Yoshua Bengio
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task. |
| Researcher Affiliation | Collaboration | Montreal Institute for Learning Algorithms (MILA), Canada Microsoft Research, Canada Ecole Polytechnique, Canada CIFAR Senior Fellow |
| Pseudocode | No | The paper describes the model with equations and diagrams, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/dmitriy-serdyuk/twin-net/. |
| Open Datasets | Yes | We evaluate our model on the Wall Street Journal (WSJ) dataset closely following the setting described in Bahdanau et al. (2016)., Microsoft COCO dataset (Lin et al., 2014)., sequential MNIST., Penn Tree Bank and Wiki Text-2 datasets (Merity et al., 2017). |
| Dataset Splits | Yes | These are 80,000 training images and 5,000 images for validation and test. We do early stopping based on the validation CIDEr scores and we report BLEU-1 to BLEU-4, CIDEr, and Meteor scores. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions frameworks like Theano, Blocks and Fuel, and Pytorch, but does not provide specific version numbers for these or other software libraries used in the experiments. |
| Experiment Setup | Yes | We use 40 mel-filter bank features with delta and delta-deltas with their energies as the acoustic in-puts to the model... The resulting input feature dimension is 123... pretrain the model for 1 epoch... 10 epochs of training... annealing on the models with 2 different learning rates and 3 epochs for each annealing stage. We use the Ada Delta optimizer for training. We perform a small hyper-parameter search on the weight α of our twin loss, α {2.0, 1.5, 1.0, 0.5, 0.25, 0.1}... We use an LSTM with 512 hidden units... Both models are trained with the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10-4... we use an LSTM with 3-layers of 512 hidden units for both forward and backward LSTMs, batch size 20, learning rate 0.001 and clip the gradient norms to 5. We use Adam (Kingma & Ba, 2014) as our optimization algorithm and we decay the learning rate by half after 5, 10, and 15 epochs. |