reproducibilityindex.ai

Likelihood-Based Diffusion Language Models

Authors: Ishaan Gulrajani, Tatsunori B. Hashimoto

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using our methods and scaling analysis, we train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates ﬂuent samples in unconditional and zero-shot control settings. In this section, we validate different aspects of the Plaid framework through compute-matched ablation experiments.
Researcher Affiliation	Academia	Ishaan Gulrajani Stanford University igul222@gmail.com Tatsunori B. Hashimoto Stanford University thashim@stanford.edu
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	1We release our code and pretrained models at https://github.com/igul222/plaid.
Open Datasets	Yes	We train and release Plaid 1B, a large diffusion language model pretrained on Open Web Text2 [7].
Dataset Splits	No	The paper mentions training on Open Web Text2 and reporting results on 'held-out data', and evaluating on benchmark datasets using 'zero-shot likelihood' and 'nonoverlapping 1024-token sequences'. However, specific percentages, sample counts, or explicit definitions of train/validation/test splits for their own models or how they were derived for the benchmarks are not provided.
Hardware Specification	Yes	All of our small runs take less than 24 hours on a single A100. Training took 30 days on 8 A100s.
Software Dependencies	No	The paper mentions the use of Flash Attention and µTransfer, but no specific version numbers for software libraries or dependencies are provided.
Experiment Setup	Yes	Our reference model ( full method ) is a 16 384 Transformer with 28M non-embedding parameters, trained for 92K steps at batch size 256 and sequence length 256... We optimize all models using Adam W with parameter-speciﬁc learning rates derived by µTransfer [35] based on a learning rate of 1.4 10 3 at width 256. Each parameter s weight decay is set to 4 10 5 η where η is that parameter s learning rate. We use a linear warmup on the learning rate and weight decay over the ﬁrst 2500 steps, followed by a linear decay to zero over training. We train at batch size 256 for algorithm ablations and 128 for scaling law experiments.