Learning Multiscale Transformer Models for Sequence Generation
Authors: Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiao, Jingbo Zhu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed Universal Multi Scale Transformer, namely UMST, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China. |
| Pseudocode | No | The paper does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/libeineu/UMST. |
| Open Datasets | Yes | We report results on two machine translation datasets, including a large-scale WMT 14 English-German (En-De) dataset, and a WMT 16 English Romanian (En-Ro) dataset. For the En-De dataset, the training data consisted of approximately 4.5M tokenized sentence pairs, as in (Vaswani et al., 2017). ... We also test the models s ability to process long sequences on the CNN-Daily Mail summarization task (Nallapati et al., 2016; Hermann et al., 2015). |
| Dataset Splits | Yes | We selected newstest2013 as the validation data and newstest2014 as the test data. ... We use newsdev-2016 and newstest-2016 as the validation and test sets, respectively. |
| Hardware Specification | No | The paper mentions simulating a "128-gpu batching schema" and GPU allocation, but does not specify any particular GPU models (e.g., NVIDIA V100, A100) or other hardware components like CPUs or memory details. |
| Software Dependencies | No | The paper mentions using "Adam optimizer (Kingma & Ba, 2015)", a codebase built on "Fairseq (Ott et al., 2019)", and an "open-source parsing tool proposed by Stanford". However, it does not provide specific version numbers for these software components (e.g., Fairseq 0.10.0, Stanford CoreNLP 4.2.0), which are required for a reproducible description. |
| Experiment Setup | Yes | All systems were trained via Adam optimizer (Kingma & Ba, 2015), where β1 and β2 were set to 0.9 and 0.997. The learning rate and warmup-step were 2e-3/16000 and 2e-3/8000 for the machine translation and abstractive summarization tasks, respectively. ... For the deep model, the hidden size is 512 and the filter size of FFN is 2048. We split the hidden space into 8 pieces for the multi-head attention mechanism. The values of dropout are set to 0.1, and so as the label smoothing. For the big model, the hidden size and the filter size are twice larger compared with the deep model. Note that the residual dropout is 0.3 for big models. ... simulate a 128-gpu batching schema via the gradient accumulation strategy, where the max-token size is 9600 and every 8 steps to update the parameters. |