Depthwise Separable Convolutions for Neural Machine Translation
Authors: Lukasz Kaiser, Aidan N. Gomez, Francois Chollet
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design our experiments with the goal to answer two key questions: What is the performance impact of replacing convolutions in a Byte Net-like model with depthwise separable convolutions? What is the performance trade-off of reducing dilation while correspondingly increasing convolution window size? In addition, we make two auxiliary experiments: |
| Researcher Affiliation | Collaboration | Łukasz Kaiser Google Brain lukaszkaiser@google.com Aidan N. Gomez University of Toronto aidan@cs.toronto.edu François Chollet Google Brain fchollet@google.com |
| Pseudocode | No | The paper describes the model architecture and components using mathematical equations and function definitions (e.g., "Conv Step k=K,d=D (x) = LN(Sep Conv(Wp, Wd, Re LU(x)))"), but it does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/tensorflow/tensor2tensor |
| Open Datasets | Yes | We evaluate all models on the WMT English to German translation task and use newstest2013 evaluation set for this purpose. For two best large models, we also provide results on the standard test set, newstest2014, to compare with other works. For tokenization, we use subword units, and follow the same tokenization process as Sennrich et al. (2015). |
| Dataset Splits | No | We evaluate all models on the WMT English to German translation task and use newstest2013 evaluation set for this purpose. For two best large models, we also provide results on the standard test set, newstest2014, to compare with other works. The paper mentions using specific evaluation sets (newstest2013) and test sets (newstest2014) from the WMT task, which are standard benchmarks. However, it does not explicitly state the train/validation/test splits (e.g., percentages or counts) or the methodology for creating these splits from a larger dataset within the text, beyond naming the evaluation/test sets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or cloud instances) used for running the experiments. |
| Software Dependencies | No | All of our experiments are implemented using the Tensor Flow framework (Abadi et al., 2015). While TensorFlow is named, no specific version number for it or any other software dependency is provided. |
| Experiment Setup | Yes | Performance on WMT EN-DE after 250k gradient descent steps. (Table 2 caption), For getting the BLEU, we used a beam-search decoder with a beam size of 4 and a length penalty tuned on the evaluation set (newstest2013). |