Depthwise Separable Convolutions for Neural Machine Translation

Authors: Lukasz Kaiser, Aidan N. Gomez, Francois Chollet

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design our experiments with the goal to answer two key questions: What is the performance impact of replacing convolutions in a Byte Net-like model with depthwise separable convolutions? What is the performance trade-off of reducing dilation while correspondingly increasing convolution window size? In addition, we make two auxiliary experiments:
Researcher Affiliation Collaboration Łukasz Kaiser Google Brain lukaszkaiser@google.com Aidan N. Gomez University of Toronto aidan@cs.toronto.edu François Chollet Google Brain fchollet@google.com
Pseudocode No The paper describes the model architecture and components using mathematical equations and function definitions (e.g., "Conv Step k=K,d=D (x) = LN(Sep Conv(Wp, Wd, Re LU(x)))"), but it does not present structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/tensorflow/tensor2tensor
Open Datasets Yes We evaluate all models on the WMT English to German translation task and use newstest2013 evaluation set for this purpose. For two best large models, we also provide results on the standard test set, newstest2014, to compare with other works. For tokenization, we use subword units, and follow the same tokenization process as Sennrich et al. (2015).
Dataset Splits No We evaluate all models on the WMT English to German translation task and use newstest2013 evaluation set for this purpose. For two best large models, we also provide results on the standard test set, newstest2014, to compare with other works. The paper mentions using specific evaluation sets (newstest2013) and test sets (newstest2014) from the WMT task, which are standard benchmarks. However, it does not explicitly state the train/validation/test splits (e.g., percentages or counts) or the methodology for creating these splits from a larger dataset within the text, beyond naming the evaluation/test sets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or cloud instances) used for running the experiments.
Software Dependencies No All of our experiments are implemented using the Tensor Flow framework (Abadi et al., 2015). While TensorFlow is named, no specific version number for it or any other software dependency is provided.
Experiment Setup Yes Performance on WMT EN-DE after 250k gradient descent steps. (Table 2 caption), For getting the BLEU, we used a beam-search decoder with a beam size of 4 and a length penalty tuned on the evaluation set (newstest2013).