reproducibilityindex.ai

Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models

Authors: Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate the effectiveness of Do T in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy.
Researcher Affiliation	Collaboration	Jiacheng Ye1 , Shansan Gong1 , Liheng Chen1 , Lin Zheng1, Jiahui Gao2,Han Shi2,Chuan Wu1,Xin Jiang2, Zhenguo Li2,Wei Bi3,Lingpeng Kong1 1 The University of Hong Kong 2 Huawei Noah s Ark Lab 3 Tencent AI Lab
Pseudocode	No	The paper describes methods using prose and mathematical equations, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We release all the codes at https://github.com/HKUNLP/diffusionof-thoughts.
Open Datasets	Yes	Following Deng et al. [7], we employ the four-digit (4x4) and five-digit (5x5) multiplication problems from the BIG-bench benchmark [49]... we adopt the widely-used GSM8K dataset [6]. We use the augmented training data from Deng et al. [7] and keep all original test sets unchanged. The statistics are listed in Appendix B.1. ... For the digit multiplication datasets and GSM8K dataset, we use processed datasets from Implict Co T6 [7]. For boolean logic task, we construct the training and test dataset using the method from Dy Val7 [68]. (Footnotes provide links: '6https://github.com/da03/implicit_chain_of_thought 7https://github.com/microsoft/promptbench/blob/main/examples/dyval.ipynb')
Dataset Splits	No	The paper mentions 'augmented training data' and 'original test sets' and provides total test examples and training set sizes in Appendix B.1 (Table 4), but does not explicitly specify validation dataset splits (e.g., percentages or counts).
Hardware Specification	Yes	We conduct all the experiments on 8 NVIDIA V100-32G GPUs.
Software Dependencies	No	The paper mentions general software components like 'GPT2Tokenizer' and 'Adam optimizer', but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions) to ensure reproducibility.
Experiment Setup	Yes	During training, we set ϵmin to be 0.95 as we find decreasing the probability of oracle demonstration hinders model training. We choose coupled sampling γ = 0.01, k = 1 and self-consistency m = 20. ... During inference, we set both the temperature of the score and output logit to 0.5 to sharpen the predicted output distribution while maintaining the ability to generate diverse samples. The sampling timesteps T is dynamic. By default, we set it to be 64. ... The learning rate is 1e-4 and we train for 60k steps with the batch size of 128 and max sequence length of 128. For digit multiplication in Table 1, we use sampling step T = 1... For Do T fine-tuned from Plaid, we set the training steps of the Do T and multi-pass Do T to be 120k and 30k respectively... The learning rate is set to 1e-4 for boolean logic and 3e-4 for other datasets. The max sequence length is set to 384 for the boolean logic dataset and 256 for others.