Federated Nearest Neighbor Machine Translation

Authors: Yichao Du, Zhirui Zhang, Bingzhe Wu, Lemao Liu, Tong Xu, Enhong Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Fed NN significantly reduces computational and communication costs compared with Fed Avg, while maintaining promising translation performance in different FL settings.
Researcher Affiliation Collaboration University of Science and Technology of China State Key Laboratory of Cognitive Intelligence Tencent AI Lab duyichao@mail.ustc.edu.cn {tongxu, cheneh}@ustc.edu.cn zrustc11@gmail.com {bingzhewu, redmondliu}@tencent.com
Pseudocode No The paper includes a workflow diagram (Figure 1) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is open-sourced on https://github.com/duyichao/Fed NN-MT.
Open Datasets Yes We adopt WMT14 En-De data (Bojar et al., 2014) and multi-domain En-De dataset (Koehn & Knowles, 2017) to simulate two typical FL scenarios for model evaluation: 1) the non-independently identically distribution (Non-IID setting) where each client distributes data from different domains; 2) the independently identically distribution (IID setting) where each client contains the same data distribution from all domains.
Dataset Splits Yes Table 3: The statistics of datasets for server and clients. Server WMT14 ... Dev 45,206 ... Client IT ... Dev 2,000
Hardware Specification Yes We train all models with 4 Tesla-V100 GPU and set patience to 5 to select the best checkpoint on the validation set.
Software Dependencies No The paper mentions software like FAIRSEQ, Adam optimizer, FAISS, Moses toolkit, and sacre BLEU, but it does not specify version numbers for these components, which is required for reproducibility.
Experiment Setup Yes The input embedding size of the transformer layer is 512, the FFN layer dimension is 2048, and the number of self-attention heads is 8. During training, we deploy the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 5e-4 and 4K warm-up updates to optimize model parameters. Both label smoothing coefficient and dropout rate are set to 0.1. The batch size is set to 16K tokens.