Likelihood-Based Diffusion Language Models

Authors: Ishaan Gulrajani, Tatsunori B. Hashimoto

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using our methods and scaling analysis, we train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates fluent samples in unconditional and zero-shot control settings. In this section, we validate different aspects of the Plaid framework through compute-matched ablation experiments.
Researcher Affiliation Academia Ishaan Gulrajani Stanford University igul222@gmail.com Tatsunori B. Hashimoto Stanford University thashim@stanford.edu
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 1We release our code and pretrained models at https://github.com/igul222/plaid.
Open Datasets Yes We train and release Plaid 1B, a large diffusion language model pretrained on Open Web Text2 [7].
Dataset Splits No The paper mentions training on Open Web Text2 and reporting results on 'held-out data', and evaluating on benchmark datasets using 'zero-shot likelihood' and 'nonoverlapping 1024-token sequences'. However, specific percentages, sample counts, or explicit definitions of train/validation/test splits for their own models or how they were derived for the benchmarks are not provided.
Hardware Specification Yes All of our small runs take less than 24 hours on a single A100. Training took 30 days on 8 A100s.
Software Dependencies No The paper mentions the use of Flash Attention and µTransfer, but no specific version numbers for software libraries or dependencies are provided.
Experiment Setup Yes Our reference model ( full method ) is a 16 384 Transformer with 28M non-embedding parameters, trained for 92K steps at batch size 256 and sequence length 256... We optimize all models using Adam W with parameter-specific learning rates derived by µTransfer [35] based on a learning rate of 1.4 10 3 at width 256. Each parameter s weight decay is set to 4 10 5 η where η is that parameter s learning rate. We use a linear warmup on the learning rate and weight decay over the first 2500 steps, followed by a linear decay to zero over training. We train at batch size 256 for algorithm ablations and 128 for scaling law experiments.