Binarized Neural Machine Translation

Authors: Yichi Zhang, Ankush Garg, Yuan Cao, Lukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16 smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings3.In this section, we empirically evaluate our proposed binarized Transformer on MT tasks at difference scales.
Researcher Affiliation Collaboration Yichi Zhang Cornell University yz2499@cornell.edu Ankush Garg* Google Deep Mind ankugarg@google.com Yuan Cao Google Deep Mind yuancao@google.com Łukasz Lew Google Research lew@google.com Behrooz Ghorbani Open AI ghorbani@openai.com Zhiru Zhang Cornell University zhiruz@cornell.edu Orhan Firat Google Deep Mind orhanf@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Source code is available in the init2winit library: https://github.com/google/init2winit/blob/master/init2winit/model_lib/xformer_translate_binary.py
Open Datasets Yes We first train a standard 6-layer encoder-decoder (6L6L) Transformer on the WMT2017 De-En translation dataset [9] and evaluate it on the WMT2014 De-En dataset. [9] is 'Ondˇrej Bojar, Yvette Graham, and Amir Kamran. Results of the WMT17 metrics shared task. In Proceedings of the Second Conference on Machine Translation, 2017.'
Dataset Splits No The paper mentions training on WMT2017 De-En and evaluating on WMT2014 De-En and refers to 'VAL LOSS' in Table 1, but it does not provide specific details on how the training, validation, and test splits were derived (e.g., percentages, exact sample counts, or explicit mention of standard validation splits for WMT2017).
Hardware Specification Yes We train the model with a 4 8 TPU topology.
Software Dependencies No The paper mentions using specific software components like 'Adam optimizer', 'sacre BLEU library', and 'BLEURT model', but does not provide version numbers for these or other relevant software dependencies.
Experiment Setup Yes Model. We use a 6L6L Transformer as the base model. Embedding dimension is 1024. Each multi-head attention layer has 16 heads, with a dimension of 1024 for QKV if combining all the heads. The hidden projection dimension in FFNs is 4096. Dropout layers has a dropout rate of 0.1. Adam optimizer [22] is used with β1 = 0.9 and β2 = 0.98. No weight decay is applied. Batch size is 1024. Base learning rate is 0.001. The first LR cycle has 50000 steps, others have 88339 steps.