Likelihood-Based Diffusion Language Models
Authors: Ishaan Gulrajani, Tatsunori B. Hashimoto
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using our methods and scaling analysis, we train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates fluent samples in unconditional and zero-shot control settings. In this section, we validate different aspects of the Plaid framework through compute-matched ablation experiments. |
| Researcher Affiliation | Academia | Ishaan Gulrajani Stanford University igul222@gmail.com Tatsunori B. Hashimoto Stanford University thashim@stanford.edu |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 1We release our code and pretrained models at https://github.com/igul222/plaid. |
| Open Datasets | Yes | We train and release Plaid 1B, a large diffusion language model pretrained on Open Web Text2 [7]. |
| Dataset Splits | No | The paper mentions training on Open Web Text2 and reporting results on 'held-out data', and evaluating on benchmark datasets using 'zero-shot likelihood' and 'nonoverlapping 1024-token sequences'. However, specific percentages, sample counts, or explicit definitions of train/validation/test splits for their own models or how they were derived for the benchmarks are not provided. |
| Hardware Specification | Yes | All of our small runs take less than 24 hours on a single A100. Training took 30 days on 8 A100s. |
| Software Dependencies | No | The paper mentions the use of Flash Attention and µTransfer, but no specific version numbers for software libraries or dependencies are provided. |
| Experiment Setup | Yes | Our reference model ( full method ) is a 16 384 Transformer with 28M non-embedding parameters, trained for 92K steps at batch size 256 and sequence length 256... We optimize all models using Adam W with parameter-specific learning rates derived by µTransfer [35] based on a learning rate of 1.4 10 3 at width 256. Each parameter s weight decay is set to 4 10 5 η where η is that parameter s learning rate. We use a linear warmup on the learning rate and weight decay over the first 2500 steps, followed by a linear decay to zero over training. We train at batch size 256 for algorithm ablations and 128 for scaling law experiments. |