reproducibilityindex.ai

Twin Networks: Matching the Future for Sequence Generation

Authors: Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris Pal, Yoshua Bengio

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves signiﬁcant improvement on a COCO caption generation task.
Researcher Affiliation	Collaboration	Montreal Institute for Learning Algorithms (MILA), Canada Microsoft Research, Canada Ecole Polytechnique, Canada CIFAR Senior Fellow
Pseudocode	No	The paper describes the model with equations and diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code is available at https://github.com/dmitriy-serdyuk/twin-net/.
Open Datasets	Yes	We evaluate our model on the Wall Street Journal (WSJ) dataset closely following the setting described in Bahdanau et al. (2016)., Microsoft COCO dataset (Lin et al., 2014)., sequential MNIST., Penn Tree Bank and Wiki Text-2 datasets (Merity et al., 2017).
Dataset Splits	Yes	These are 80,000 training images and 5,000 images for validation and test. We do early stopping based on the validation CIDEr scores and we report BLEU-1 to BLEU-4, CIDEr, and Meteor scores.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions frameworks like Theano, Blocks and Fuel, and Pytorch, but does not provide specific version numbers for these or other software libraries used in the experiments.
Experiment Setup	Yes	We use 40 mel-ﬁlter bank features with delta and delta-deltas with their energies as the acoustic in-puts to the model... The resulting input feature dimension is 123... pretrain the model for 1 epoch... 10 epochs of training... annealing on the models with 2 different learning rates and 3 epochs for each annealing stage. We use the Ada Delta optimizer for training. We perform a small hyper-parameter search on the weight α of our twin loss, α {2.0, 1.5, 1.0, 0.5, 0.25, 0.1}... We use an LSTM with 512 hidden units... Both models are trained with the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10-4... we use an LSTM with 3-layers of 512 hidden units for both forward and backward LSTMs, batch size 20, learning rate 0.001 and clip the gradient norms to 5. We use Adam (Kingma & Ba, 2014) as our optimization algorithm and we decay the learning rate by half after 5, 10, and 15 epochs.