Training and Inference on Any-Order Autoregressive Models the Right Way

Authors: Andy Shih, Dorsa Sadigh, Stefano Ermon

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MAC on high-dimensional language and image domains, and on a set of continuous tabular benchmarks. We focus on two metrics: joint likelihood and marginal likelihood of the test set. On both metrics, MAC shows state-of-the-art performance among arbitrary conditional models on the majority of benchmarks.
Researcher Affiliation Academia Andy Shih Stanford University andyshih@cs.stanford.edu Dorsa Sadigh Stanford University dorsa@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu
Pseudocode Yes Algorithm 1: Training MAC, Algorithm 2: Testing MAC
Open Source Code Yes (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In the supplemental material.
Open Datasets Yes We learn a character-level model on chunks of 250 characters using the Text8 dataset [24]... We evaluate MAC on CIFAR10 [19] and Image Net32 [4, 6]... All datasets are publicly available.
Dataset Splits No The paper mentions using standard datasets like Text8, CIFAR10, and Image Net32 and states that it follows the setup from previous work, but it does not explicitly provide the training/validation/test dataset split percentages or sample counts within the text.
Hardware Specification Yes Each run was done on a single NVIDIA A40.
Software Dependencies No The paper mentions various model architectures and references general frameworks or tools (e.g., Transformer, UNet, EBMs) but does not provide specific software dependencies with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes We trained each model for approximately two weeks. Although this was not enough to match the total number of epochs trained by the baseline ARDM1, we were still able to show state-of-the-art performance for arbitrary conditional models on 2 out of the 3 language/image benchmarks, and beat the baselines on all 3 benchmarks when compared under the same number of training epochs. ... ARDM (3000 epochs), MAC (3000 epochs) ... ARDM (1200 epochs), MAC (1200 epochs) ... keeping all hyperparameters and experimental setup the same, modifying only the training mask distribution.