Discovering Non-monotonic Autoregressive Orderings with Variational Inference
Authors: Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Yang Gao
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results with our solution on image captioning, code generation, text summarization, and machine translation tasks suggest that with similar hyperparameters, our algorithm is capable of recovering autoregressive orders that are even better than fixed orders. |
| Researcher Affiliation | Academia | Xuanlin Li , Brandon Trabucco , Dong Huk Park, Michael Luo University of California, Berkeley {xuanlinli17, btrabucco, dong.huk.park, michael.luo}@berkeley.edu Sheng Shen, Trevor Darrell, Yang Gao University of California, Berkeley; Tsinghua University {sheng.s, trevordarrell}@berkeley.edu, gy20073@gmail.com |
| Pseudocode | Yes | Algorithm 1 Variational Order Inference |
| Open Source Code | Yes | Our experimental framework is available at this link. |
| Open Datasets | Yes | For NL2Code, we use Django (Oda et al., 2015). For image captioning, we use COCO 2017 (Lin et al., 2015). For text summarization, we use English Gigaword (Graff et al., 2003; Rush et al., 2015). For machine translation, we use WMT16 Romanian-English (Ro-En). |
| Dataset Splits | Yes | We compare metrics as a function of the sequence length of generated captions on the COCO 2017 validation set. |
| Hardware Specification | Yes | We compare the runtime performance of VOI (K = 4) with SAO on a single Tesla P100 GPU |
| Software Dependencies | No | The paper mentions using Adam Optimizer and Torch Vision, but does not specify the version numbers for the main deep learning framework (e.g., PyTorch, TensorFlow) or other critical software dependencies required for reproduction. |
| Experiment Setup | Yes | For our decoder, we set dmodel = 512, dhidden = 2048, 6 layers for both Transformer s encoder and decoder, and 8 attention heads. This is the same model configuration as Transformer Base (Vaswani et al., 2017) and as described in Gu et al. (2019a). Our encoder also uses the same configuration. For our model trained with Variational Order Inference , we sample K = 4 latents for each training sample. |