Universal Transformers
Authors: Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 EXPERIMENTS AND ANALYSIS We evaluated the Universal Transformer on a range of algorithmic and language understanding tasks, as well as on machine translation. |
| Researcher Affiliation | Collaboration | Mostafa Dehghani Stephan Gouws Oriol Vinyals University of Amsterdam Deep Mind Deep Mind dehghani@uva.nl sgouws@google.com vinyals@google.com Jakob Uszkoreit Łukasz Kaiser Google Brain Google Brain usz@google.com lukaszkaiser@google.com |
| Pseudocode | Yes | APPENDIX C UT WITH DYNAMIC HALTING We implement the dynamic halting based on ACT (Graves, 2016) as follows in Tensor Flow. In each step of the UT with dynamic halting, we are given the halting probabilities, remainders, number of updates up to that point, and the previous state (all initialized as zeros), as well as a scalar threshold between 0 and 1 (a hyper-parameter). We then compute the new state for each position and calculate the new per-position halting probabilities based on the state for each position. The UT then decides to halt for some positions that crossed the threshold, and updates the state of other positions until the model halts for all blocks halt, or we reach a maximum number of steps: Listing 1: UT with dynamic halting. The following shows the computations in each step: Listing 2: Computations in each step of the UT with dynamic halting. |
| Open Source Code | Yes | The code used to train and evaluate Universal Transformers is available at https: //github.com/tensorflow/tensor2tensor (Vaswani et al., 2018). |
| Open Datasets | Yes | The b Abi question answering dataset (Weston et al., 2015) consists of 20 different tasks... We use the dataset provided by (Linzen et al., 2016)... The LAMBADA task (Paperno et al., 2016) is a language modeling task... We trained a UT on the WMT 2014 English-German translation task... |
| Dataset Splits | Yes | We conducted 10 runs with different initializations and picked the best model based on performance on the validation set, similar to previous work. |
| Hardware Specification | Yes | Universal Transformer small 26.8 Transformer base (Vaswani et al., 2017) 28.0 Weighted Transformer base (Ahmed et al., 2017) 28.4 Universal Transformer base 28.9 Table 7: Machine translation results on the WMT14 En-De translation task trained on 8x P100 GPUs in comparable training setups. All base results have the same number of parameters. |
| Software Dependencies | No | We implement the dynamic halting based on ACT (Graves, 2016) as follows in Tensor Flow. In each step of the UT with dynamic halting, we are given the halting probabilities, remainders, number of updates up to that point, and the previous state (all initialized as zeros), as well as a scalar threshold between 0 and 1 (a hyper-parameter). We then compute the new state for each position and calculate the new per-position halting probabilities based on the state for each position. The UT then decides to halt for some positions that crossed the threshold, and updates the state of other positions until the model halts for all blocks halt, or we reach a maximum number of steps: Listing 1: UT with dynamic halting. |
| Experiment Setup | Yes | We evaluated the Universal Transformer on a range of algorithmic and language understanding tasks, as well as on machine translation. ... To encode the input, similar to Henaff et al. (2016), we first encode each fact in the story by applying a learned multiplicative positional mask to each word s embedding, and summing up all embeddings. We embed the question in the same way, and then feed the (Universal) Transformer with these embeddings of the facts and questions. ... We trained UTs on three algorithmic tasks, namely Copy, Reverse, and (integer) Addition, all on strings composed of decimal symbols ( 0 9 ). In all the experiments, we train the models on sequences of length 40 and evaluated on sequences of length 400 (Kaiser & Sutskever, 2016). |