Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo
Authors: Stephen Zhao, Rob Brekelmans, Alireza Makhzani, Roger Baker Grosse
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now illustrate empirically how our framework can be used to evaluate inference through log Zσ bounds and KL divergences between the sampling and target distributions, providing meaningful quantitative comparison between various learning methods. We consider a range of tasks throughout this section, including toxic story generation (as an example of uncovering rare undesirable behavior), generating reviews with varied sentiment, and infilling. For the toxicity and infilling tasks, we consider the Tiny Stories model (Eldan & Li, 2023) as a small-scale model where the generation is coherent, and use the prompt of Once upon a time, there was a . For the toxicity task, we elicit responses judged to be toxic by the classifier from Corrˆea (2023). For the sentiment task, we consider the GPT2-Medium model (Radford et al., 2019) and a classifier trained on Amazon reviews (Li, 2023). |
| Researcher Affiliation | Academia | 1University of Toronto 2Vector Institute. Correspondence to: {stephenzhao, makhzani, rgrosse} @cs.toronto.edu, EMAIL. |
| Pseudocode | Yes | Algorithm 1 (Twisted) SMC Sampling (q SMC) |
| Open Source Code | Yes | Our code is available at https://github.com/Silent-Zebra/twisted-smc-lm . |
| Open Datasets | Yes | For the toxicity and infilling tasks, we consider the Tiny Stories model (Eldan & Li, 2023)... For the toxicity task, we elicit responses judged to be toxic by the classifier from Corrˆea (2023). For the sentiment task, we consider the GPT2-Medium model (Radford et al., 2019) and a classifier trained on Amazon reviews (Li, 2023). |
| Dataset Splits | No | The paper mentions batch sizes and training steps but does not provide specific train/validation/test dataset splits with percentages or counts for their experiments. |
| Hardware Specification | Yes | All of our experiments were run on a single GPU, usually on an NVIDIA A40 with 48G memory. |
| Software Dependencies | No | The paper mentions the use of Adam optimizer, Hugging Face TRL PPO Trainer, Optax (Flax), and Hugging Face models, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use a batch size (number of SMC particles/samples) of 1000, with a learning rate of 0.0001, and train using CTL for a total of 5000 gradient updates. |