Training and Inference on Any-Order Autoregressive Models the Right Way
Authors: Andy Shih, Dorsa Sadigh, Stefano Ermon
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MAC on high-dimensional language and image domains, and on a set of continuous tabular benchmarks. We focus on two metrics: joint likelihood and marginal likelihood of the test set. On both metrics, MAC shows state-of-the-art performance among arbitrary conditional models on the majority of benchmarks. |
| Researcher Affiliation | Academia | Andy Shih Stanford University andyshih@cs.stanford.edu Dorsa Sadigh Stanford University dorsa@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1: Training MAC, Algorithm 2: Testing MAC |
| Open Source Code | Yes | (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In the supplemental material. |
| Open Datasets | Yes | We learn a character-level model on chunks of 250 characters using the Text8 dataset [24]... We evaluate MAC on CIFAR10 [19] and Image Net32 [4, 6]... All datasets are publicly available. |
| Dataset Splits | No | The paper mentions using standard datasets like Text8, CIFAR10, and Image Net32 and states that it follows the setup from previous work, but it does not explicitly provide the training/validation/test dataset split percentages or sample counts within the text. |
| Hardware Specification | Yes | Each run was done on a single NVIDIA A40. |
| Software Dependencies | No | The paper mentions various model architectures and references general frameworks or tools (e.g., Transformer, UNet, EBMs) but does not provide specific software dependencies with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | We trained each model for approximately two weeks. Although this was not enough to match the total number of epochs trained by the baseline ARDM1, we were still able to show state-of-the-art performance for arbitrary conditional models on 2 out of the 3 language/image benchmarks, and beat the baselines on all 3 benchmarks when compared under the same number of training epochs. ... ARDM (3000 epochs), MAC (3000 epochs) ... ARDM (1200 epochs), MAC (1200 epochs) ... keeping all hyperparameters and experimental setup the same, modifying only the training mask distribution. |