Binarized Neural Machine Translation
Authors: Yichi Zhang, Ankush Garg, Yuan Cao, Lukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16 smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings3.In this section, we empirically evaluate our proposed binarized Transformer on MT tasks at difference scales. |
| Researcher Affiliation | Collaboration | Yichi Zhang Cornell University yz2499@cornell.edu Ankush Garg* Google Deep Mind ankugarg@google.com Yuan Cao Google Deep Mind yuancao@google.com Łukasz Lew Google Research lew@google.com Behrooz Ghorbani Open AI ghorbani@openai.com Zhiru Zhang Cornell University zhiruz@cornell.edu Orhan Firat Google Deep Mind orhanf@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code is available in the init2winit library: https://github.com/google/init2winit/blob/master/init2winit/model_lib/xformer_translate_binary.py |
| Open Datasets | Yes | We first train a standard 6-layer encoder-decoder (6L6L) Transformer on the WMT2017 De-En translation dataset [9] and evaluate it on the WMT2014 De-En dataset. [9] is 'Ondˇrej Bojar, Yvette Graham, and Amir Kamran. Results of the WMT17 metrics shared task. In Proceedings of the Second Conference on Machine Translation, 2017.' |
| Dataset Splits | No | The paper mentions training on WMT2017 De-En and evaluating on WMT2014 De-En and refers to 'VAL LOSS' in Table 1, but it does not provide specific details on how the training, validation, and test splits were derived (e.g., percentages, exact sample counts, or explicit mention of standard validation splits for WMT2017). |
| Hardware Specification | Yes | We train the model with a 4 8 TPU topology. |
| Software Dependencies | No | The paper mentions using specific software components like 'Adam optimizer', 'sacre BLEU library', and 'BLEURT model', but does not provide version numbers for these or other relevant software dependencies. |
| Experiment Setup | Yes | Model. We use a 6L6L Transformer as the base model. Embedding dimension is 1024. Each multi-head attention layer has 16 heads, with a dimension of 1024 for QKV if combining all the heads. The hidden projection dimension in FFNs is 4096. Dropout layers has a dropout rate of 0.1. Adam optimizer [22] is used with β1 = 0.9 and β2 = 0.98. No weight decay is applied. Batch size is 1024. Base learning rate is 0.001. The first LR cycle has 50000 steps, others have 88339 steps. |