reproducibilityindex.ai

Nearest Neighbor Machine Translation

Authors: Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that k NN-MT scales to datastores containing billions of tokens, improving results across a range of settings. For example, it improves a state-of-the-art German English translation model by 1.5 BLEU. and 3 EXPERIMENTAL SETUP We experiment with k NN-MT in three settings: (1) single language-pair translation, (2) multilingual MT and (3) domain adaptation.
Researcher Affiliation	Collaboration	Stanford University Facebook AI Research {urvashik,jurafsky}@stanford.edu {angelafan,lsz,mikelewis}@fb.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 1 provides an illustration of the k NN distribution computation, but it is not a formal pseudocode.
Open Source Code	No	Code for k NN-MT will be available at https://github.com/urvashik/knnlm.
Open Datasets	Yes	Data We use the following datasets for training and evaluation. WMT 19: For the single language-pair experiments, we use WMT 19 data for German-English. CCMATRIX: We train our multilingual model on CCMatrix (Schwenk et al., 2019), containing parallel data for 79 languages and 1,546 language pairs. MULTI-DOMAINS: We use the multi-domains dataset (Koehn & Knowles, 2017), re-split by Aharoni & Goldberg (2020) for the domain adaptation experiments.
Dataset Splits	Yes	NEWSTEST: The newstest2018 and newstest2019 test sets from WMT (Bojar et al., 2018; Barrault et al., 2019) are used as validation and test sets for the multilingual experiments. The same German English validation and test sets are also used for evaluation in the single language-pair and domain adaptation experiments. The interpolation and softmax temperature parameters are tuned on the validation sets. We provide validation set BLEU scores as well as hyperparameter choices for our experiments in Appendix A.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. It mentions FAISS for similarity search but no details on the hardware it was run on.
Software Dependencies	No	The paper mentions software like FAIRSEQ library (Ott et al., 2019) and sentencepiece (Kudo & Richardson, 2018) and uses SACREBLEU (Post, 2018) for metrics but does not provide specific version numbers for FAIRSEQ or sentencepiece, only for SACREBLEU's signature.
Experiment Setup	Yes	For the single language-pair and domain adaptation experiments, we use the WMT 19 German-English news translation task winner (Ng et al., 2019), available via the FAIRSEQ library (Ott et al., 2019). It is a Transformer encoder-decoder model (Vaswani et al., 2017) with 6 layers, 1,024 dimensional representations, 8,192 dimensional feedforward layers and 8 attention heads. For multilingual MT, we trained a 418M parameter Transformer-based encoder-decoder model on the CCMatrix data for 100K updates. The model has embedding dimension 1,024, hidden dimension 4,096, 12 layers in both the encoder and decoder, with 16 attention heads. During inference, we query the datastore for 64 neighbors while searching 32 clusters. The interpolation and softmax temperature parameters are tuned on the validation sets.