Discovering Non-monotonic Autoregressive Orderings with Variational Inference

Authors: Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Yang Gao

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results with our solution on image captioning, code generation, text summarization, and machine translation tasks suggest that with similar hyperparameters, our algorithm is capable of recovering autoregressive orders that are even better than fixed orders.
Researcher Affiliation Academia Xuanlin Li , Brandon Trabucco , Dong Huk Park, Michael Luo University of California, Berkeley {xuanlinli17, btrabucco, dong.huk.park, michael.luo}@berkeley.edu Sheng Shen, Trevor Darrell, Yang Gao University of California, Berkeley; Tsinghua University {sheng.s, trevordarrell}@berkeley.edu, gy20073@gmail.com
Pseudocode Yes Algorithm 1 Variational Order Inference
Open Source Code Yes Our experimental framework is available at this link.
Open Datasets Yes For NL2Code, we use Django (Oda et al., 2015). For image captioning, we use COCO 2017 (Lin et al., 2015). For text summarization, we use English Gigaword (Graff et al., 2003; Rush et al., 2015). For machine translation, we use WMT16 Romanian-English (Ro-En).
Dataset Splits Yes We compare metrics as a function of the sequence length of generated captions on the COCO 2017 validation set.
Hardware Specification Yes We compare the runtime performance of VOI (K = 4) with SAO on a single Tesla P100 GPU
Software Dependencies No The paper mentions using Adam Optimizer and Torch Vision, but does not specify the version numbers for the main deep learning framework (e.g., PyTorch, TensorFlow) or other critical software dependencies required for reproduction.
Experiment Setup Yes For our decoder, we set dmodel = 512, dhidden = 2048, 6 layers for both Transformer s encoder and decoder, and 8 attention heads. This is the same model configuration as Transformer Base (Vaswani et al., 2017) and as described in Gu et al. (2019a). Our encoder also uses the same configuration. For our model trained with Variational Order Inference , we sample K = 4 latents for each training sample.