Monotonic Chunkwise Attention

Authors: Chung-Cheng Chiu*, Colin Raffel*

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model.
Researcher Affiliation Industry Chung-Cheng Chiu & Colin Raffel Google Brain Mountain View, CA, 94043, USA {chungchengc,craffel}@google.com
Pseudocode Yes Algorithm 1 Mo Ch A decoding process (test time).
Open Source Code Yes To facilitate building on our work, we provide an example implementation of Mo Ch A online.2 https://github.com/craffel/mocha
Open Datasets Yes Online speech recognition on the Wall Street Journal (WSJ) corpus (Paul & Baker, 1992). ... Document summarization on the CNN/Daily Mail corpus (Nallapati et al., 2016).
Dataset Splits Yes In all experiments, we report metrics on the test set at the training step of best performance on a validation set.
Hardware Specification No The paper mentions that experiments were done using TensorFlow but does not specify any hardware details like GPU models, CPU types, or cloud instance specifications.
Software Dependencies No All experiments were done using Tensor Flow (Abadi et al., 2016). ... Further, we coded the benchmark in C++ using the Eigen library (Guennebaud et al., 2010). No specific version numbers for TensorFlow or other libraries used in the main experiments are provided.
Experiment Setup Yes Specifically, for Mo Ch A we used eq. (13) for both the Monotonic Energy and the Chunk Energy functions. Following (Raffel et al., 2017), we initialized g = 1/ d (d being the attention energy function hidden dimension) and tuned initial values for r based on validation set performance, using r = 4 for Mo Ch A on speech recognition, r = 0 for Mo Ch A on summarization, and r = 1 for our monotonic attention baseline on summarization. We similarly tuned the chunk size w: For speech recognition, we were surprised to find that all of w {2, 3, 4, 6, 8} performed comparably and thus chose the smallest value of w = 2. For summarization, we found w = 8 to work best. ... The network was trained using the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, and ϵ = 10 6. The initial learning rate 0.001 was dropped by a factor of 10 after 600,000, 800,000, and 1,000,000 steps. ... Inputs were fed into the network in batches of 8 utterances...