Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions
Authors: Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, Max Welling
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method outperforms existing dequantization approaches on text modelling and modelling on image segmentation maps in log-likelihood. In our experiments we compare the performance of our methods on language modelling tasks and learning image segmentation maps unconditionally. |
| Researcher Affiliation | Academia | Emiel Hoogeboom1 , Didrik Nielsen2 , Priyank Jaini1, Patrick Forré3, Max Welling1 UvA-Bosch Delta Lab, University of Amsterdam1, Technical University of Denmark2, University of Amsterdam3 |
| Pseudocode | Yes | Algorithm 1 Sampling from Argmax Flows, Algorithm 2 Optimizing Argmax Flows, Algorithm 3 Thresholding-based q(v|x), Algorithm 4 Gumbel-based q(v|x) |
| Open Source Code | No | No statement or link regarding open-source code for their method. |
| Open Datasets | Yes | In this section we compare our methods on two language datasets, text8 and enwik8. For image-type data, we introduce a categorical image dataset: the cityscapes dataset is repurposed for unconditional image segmentation learning. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2016; pp 3213 3223. |
| Dataset Splits | No | The Multinomial Diffusion model performs somewhat worse with 0.37 bpp on test whereas it scored 0.33 bpp on train. Interestingly, this the only model where overfitting was an issue and data augmentation was required, which may explain this portion of the performance difference. For all other models training performance was comparable to test and validation performance. |
| Hardware Specification | No | No specific hardware details are provided. |
| Software Dependencies | No | In the multinomial text diffusion model, the µ network is modeled by a 12-layer Transformer. The density model p(v) is defined using affine coupling layers parametrized by Dense Nets (Huang et al., 2017). |
| Experiment Setup | No | Model description Two versions of generative argmax flows are tested: using an autoregressive (AR) flow and a coupling-based flow for p(v). In these experiments the probabilistic inverse is based on the thresholding approach. Specifically, a conditional diagonal Gaussian q(u|x) is trained and thresholded which gives the distribution q(v|x). The argmax flow is defined on binary Cartesian products. This means that for K = 27, a 5-dimensional binary space is used and for K = 256 an 8-dimensional binary space. The argmax flow is compared to the current standard of training generative flows directly on discrete data: dequantization. We compare to both uniform and variational dequantization, where noise on a (0, 1) interval is added to the onehot representation of the categorical data. The autoregressive density model is based on the model proposed in (Lippe and Gavves, 2020). The coupling density model consists of 8 flow layers where each layer consists of a 1 1 convolution and mixture of logistics transformations Ho et al. (2019). In the multinomial text diffusion model, the µ network is modeled by a 12-layer Transformer. For more extensive details about the experiment setup see Appendix B. |