reproducibilityindex.ai

Training Chain-of-Thought via Latent-Variable Inference

Authors: Du Phan, Matthew Douglas Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we ﬁnd that this MCMC-EM ﬁne-tuning technique typically improves the model s accuracy on held-out examples more than STa R or prompt-tuning with or without Co T.
Researcher Affiliation	Industry	Du Phan Matthew D. Hoffman David Dohan Sholto Douglas Tuan Anh Le Aaron Parisi Pavel Sountsov Charles Sutton Sharad Vikram Rif A. Saurous Google
Pseudocode	Yes	Algorithm 1 outlines the method.
Open Source Code	Yes	A notebook with a reference implementation can be found at https://github.com/google-research/cascades/tree/main/cascades/ examples/notebooks/trice.ipynb.
Open Datasets	Yes	We evaluate TRICE on the GSM8K (Cobbe et al., 2021) dataset and the 27 Big Bench-Hard (BBH) tasks (Suzgun et al., 2022b).
Dataset Splits	No	On each BBH task, we split the examples into 60% train and 40% test sets. For GSM8K, we use the standard 7473-example training set and 1319-example test set. All methods are evaluated against the same validation sets.
Hardware Specification	Yes	All experiments were run on TPU v4 and v5e chips (Jouppi et al., 2023).
Software Dependencies	No	The paper mentions specific models (PaLM 2, Flan) and optimizers (Adam) but does not provide version numbers for any software libraries or frameworks used.
Experiment Setup	Yes	For all BBH tasks, we run TRICE for 500 steps with batch size M = 32 and do not use subsampling (i.e., compute L = 64 gradients per batch). We use the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate 1.0 and a cosine decay schedule (Loshchilov & Hutter, 2017) that reduces the learning rate by 10x over the ﬁrst 450 steps.