reproducibilityindex.ai

Monotonic Chunkwise Attention

Authors: Chung-Cheng Chiu*, Colin Raffel*

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an ofﬂine soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show signiﬁcantly improved performance compared to a baseline monotonic attention-based model.
Researcher Affiliation	Industry	Chung-Cheng Chiu & Colin Raffel Google Brain Mountain View, CA, 94043, USA {chungchengc,craffel}@google.com
Pseudocode	Yes	Algorithm 1 Mo Ch A decoding process (test time).
Open Source Code	Yes	To facilitate building on our work, we provide an example implementation of Mo Ch A online.2 https://github.com/craffel/mocha
Open Datasets	Yes	Online speech recognition on the Wall Street Journal (WSJ) corpus (Paul & Baker, 1992). ... Document summarization on the CNN/Daily Mail corpus (Nallapati et al., 2016).
Dataset Splits	Yes	In all experiments, we report metrics on the test set at the training step of best performance on a validation set.
Hardware Specification	No	The paper mentions that experiments were done using TensorFlow but does not specify any hardware details like GPU models, CPU types, or cloud instance specifications.
Software Dependencies	No	All experiments were done using Tensor Flow (Abadi et al., 2016). ... Further, we coded the benchmark in C++ using the Eigen library (Guennebaud et al., 2010). No specific version numbers for TensorFlow or other libraries used in the main experiments are provided.
Experiment Setup	Yes	Speciﬁcally, for Mo Ch A we used eq. (13) for both the Monotonic Energy and the Chunk Energy functions. Following (Raffel et al., 2017), we initialized g = 1/ d (d being the attention energy function hidden dimension) and tuned initial values for r based on validation set performance, using r = 4 for Mo Ch A on speech recognition, r = 0 for Mo Ch A on summarization, and r = 1 for our monotonic attention baseline on summarization. We similarly tuned the chunk size w: For speech recognition, we were surprised to ﬁnd that all of w {2, 3, 4, 6, 8} performed comparably and thus chose the smallest value of w = 2. For summarization, we found w = 8 to work best. ... The network was trained using the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, and ϵ = 10 6. The initial learning rate 0.001 was dropped by a factor of 10 after 600,000, 800,000, and 1,000,000 steps. ... Inputs were fed into the network in batches of 8 utterances...