Step-unrolled Denoising Autoencoders for Text Generation
Authors: Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, Aaron van den Oord
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we propose a novel non-autoregressive method which shows state-of-the-art results in machine translation on WMT 14 EN DE raw data (without distillation from AR) amongst non-AR methods and good qualitative results on unconditional language modeling on the Colossal Clean Common Crawl (C4) dataset (Raffel et al., 2019) and a dataset of Python code from Git Hub. Our model operates as a time-homogeneous Markov Chain similar to that of Lee et al. (2018): conditioned on the corrupted data, it tries to approximate the original uncorrupted samples by a per-token... 3 EXPERIMENTS |
| Researcher Affiliation | Industry | Nikolay Savinov* Junyoung Chung* Mikołaj Bi nkowski* Erich Elsen A aron van den Oord Deep Mind, London, UK |
| Pseudocode | Yes | J PSEUDOCODE OF SUNDAE |
| Open Source Code | No | The paper does not include an explicit statement about the release of its source code or a link to a code repository for the SUNDAE methodology. |
| Open Datasets | Yes | We conduct experiments on WMT 14 parallel corpora using EN DE (4.5M pairs) and EN FR (36M pairs) translation tasks. The raw texts are encoded using BPE (Sennrich et al., 2015) as the subword units, and we use the same preprocessed data as in Vaswani et al. (2017) for fair comparisons. We train our method on a large high-quality publicly available Colossal Clean Common Crawl (C4) dataset (Raffel et al., 2019) to demonstrate samples. We use the same architecture and optimizer as in qualitative experiments and train with batch size 1024 for 8K steps (chosen to achieve the lowest validation loss). We follow the same tokenization strategy as d Autume et al. (2019), with vocabulary size of 5.7K and maximum length 52 tokens, padding shorter sequences to this maximum length. We construct the code dataset by extracting files ending in .py from open source Git Hub repositories with licenses that are one of apache-2.0, mit, bsd-3-clause, bsd-2-clause, unlicense, cc0-1.0, isc, artistic-2.0. |
| Dataset Splits | Yes | We evaluate the performance by measuring BLEU (Papineni et al., 2002; Post, 2018) on the test split of each translation task. All hyperparameter tuning is performed on a held-out validation set. |
| Hardware Specification | Yes | Our models were trained for 106 steps using 16 TPU accelerators using bfloat16 precision. |
| Software Dependencies | No | The paper mentions software like BPE, Adam optimizer, SentencePiece, and SacreBLEU, but does not provide specific version numbers for these software dependencies (e.g., 'SacreBLEU (Post, 2018)' refers to the publication year, not a software version). |
| Experiment Setup | Yes | We use the encoder-decoder Transformer architecture for MT (Vaswani et al., 2017), but remove the causality masking in the decoder. There are 6 attention layers for both encoder and decoder, 8 attention heads, 512 model dimension and 2048 feedforward hidden dimension. The total number of parameters is 63M including the target length prediction module described in 3.1. We use dropout (p = 0.1) and label smoothing (ϵ = 0.1) during training for all tasks... The training batch size is 4096, and we use Adam (Kingma & Ba, 2014) with β1 =0.9, β2 =0.999, ϵ=10 6 and weight decay of 0.1 (Loshchilov & Hutter, 2017). We warm up the learning rate from 10 7 to 10 4 in the first 5K steps and decay it to 10 5 using cosine annealing (Loshchilov & Hutter, 2016). Our models were trained for 106 steps... |