Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
Authors: Matthew Raffel, Drew Penney, Lizhong Chen
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the English-German, English-French, and English-Spanish language pairs from the MUST-C dataset demonstrate that when applied to the Augmented Memory Transformer, a state-of-the-art model for simultaneous speech translation, the proposed scheme achieves an average increase of 2.09, 1.83, and 1.95 BLEU scores across each wait-k value for the three language pairs, respectively, with a minimal impact on computation-aware Average Lagging. |
| Researcher Affiliation | Academia | 1College of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, United States. Correspondence to: Matthew Raffel <raffelm@oregonstate.edu>, Drew Penney <penneyd@oregonstate.edu>, Lizhong Chen <chenliz@oregonstate.edu>. |
| Pseudocode | No | The paper describes its methods using prose and mathematical equations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our publicly available implementation of the shiftable context for the Augmented Memory Transformer is provided in the following Git Hub repository: https://github.com/OSU-STARLAB/Shiftable_Context. |
| Open Datasets | Yes | We conducted experiments on the English-German (en-de), English-French (en-fr), and English-Spanish (en-es) language pairs from the MUST-C dataset (Cattoni et al., 2021). |
| Dataset Splits | Yes | The training was conducted on the train set. After each epoch, each model was validated against the dev set. ... The two evaluation sets used to determine the performance of the model were tst-COMMON and tst-HE. |
| Hardware Specification | Yes | All training was performed on a single V100-32GB GPU. ... The evaluations were all performed on a single V100-32GB GPU. |
| Software Dependencies | No | The paper mentions software like Fairseq, Kaldi, Sacre BLEU, and Simul Eval toolkit, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Our Augmented Memory Transformer has 33.1 M parameters. Its encoder begins with 2 convolution layers with a combined subsampling factor of 4, followed by a feedforward neural network. Similarly, the encoders of each consisted of 12 layers, and their decoders consisted of 6 layers. Each of these layers had a hidden size of 256 with 4 attention heads. Layer normalization was performed prior to each layer. Additionally, we trained each Augmented Memory Transformer with a wait-1, wait-3, wait-5, and wait-7 policy using a pre-decision ratio of 8 (Ma et al., 2018; 2020b). Such an approach allowed us to analyze how each of our proposed schemes scaled with latency while also providing more certainty about our results. The segment of each Augmented Memory Transformer was composed of a left context of 32 tokens, a center context of 64 tokens, and a right context of 32 tokens. The encoder self-attention calculation used 3 memory banks. The clipping distance of the relative positional encodings was 16 tokens (Shaw et al., 2018). ... For the ASR pretraining, the model was trained with label-smoothed cross-entropy loss, the Adam optimizer (Kingma & Ba, 2014), and an inverse square root scheduler. Each model was trained with a warmup period of 4000 updates, where the learning rate was 0.0001, followed by a learning rate of 0.0007. The only regularization for the ASR pretraining was a dropout of 0.1. Each ASR pretrained model used early stopping with a patience of 5. ... For the Simul ST training, the model was also trained with label-smoothed cross-entropy loss, the Adam optimizer, and an inverse square root scheduler. There was a warmup period of 7500 updates where the learning rate of 0.0001, followed by a learning rate of 0.00035. To regularize the model weights, we used a weight decay value of 0.0001, a dropout of 0.1, an activation dropout of 0.2, and an attention dropout of 0.2. All models were trained with early stopping using a patience of 10. After the training was complete, the final 10 checkpoints were averaged. |