Monotonic Chunkwise Attention
Authors: Chung-Cheng Chiu*, Colin Raffel*
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model. |
| Researcher Affiliation | Industry | Chung-Cheng Chiu & Colin Raffel Google Brain Mountain View, CA, 94043, USA {chungchengc,craffel}@google.com |
| Pseudocode | Yes | Algorithm 1 Mo Ch A decoding process (test time). |
| Open Source Code | Yes | To facilitate building on our work, we provide an example implementation of Mo Ch A online.2 https://github.com/craffel/mocha |
| Open Datasets | Yes | Online speech recognition on the Wall Street Journal (WSJ) corpus (Paul & Baker, 1992). ... Document summarization on the CNN/Daily Mail corpus (Nallapati et al., 2016). |
| Dataset Splits | Yes | In all experiments, we report metrics on the test set at the training step of best performance on a validation set. |
| Hardware Specification | No | The paper mentions that experiments were done using TensorFlow but does not specify any hardware details like GPU models, CPU types, or cloud instance specifications. |
| Software Dependencies | No | All experiments were done using Tensor Flow (Abadi et al., 2016). ... Further, we coded the benchmark in C++ using the Eigen library (Guennebaud et al., 2010). No specific version numbers for TensorFlow or other libraries used in the main experiments are provided. |
| Experiment Setup | Yes | Specifically, for Mo Ch A we used eq. (13) for both the Monotonic Energy and the Chunk Energy functions. Following (Raffel et al., 2017), we initialized g = 1/ d (d being the attention energy function hidden dimension) and tuned initial values for r based on validation set performance, using r = 4 for Mo Ch A on speech recognition, r = 0 for Mo Ch A on summarization, and r = 1 for our monotonic attention baseline on summarization. We similarly tuned the chunk size w: For speech recognition, we were surprised to find that all of w {2, 3, 4, 6, 8} performed comparably and thus chose the smallest value of w = 2. For summarization, we found w = 8 to work best. ... The network was trained using the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, and ϵ = 10 6. The initial learning rate 0.001 was dropped by a factor of 10 after 600,000, 800,000, and 1,000,000 steps. ... Inputs were fed into the network in batches of 8 utterances... |