reproducibilityindex.ai

Successor Features for Efficient Multi-Subject Controlled Text Generation

Authors: Meng Cao, Mehdi Fatemi, Jackie Ck Cheung, Samira Shabanian

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on two NLG tasks: sentiment control and detoxification. Through our evaluation, we demonstrate the effectiveness of our approach in steering the model away from undesired sentiment and in substantially reducing the generation of harmful content. Our method outperforms five baseline models in both tasks and is on par with the SOTA.
Researcher Affiliation	Collaboration	Meng Cao * 1 2 3 Mehdi Fatemi * 4 Jackie Chi Kit Cheung 1 2 5 Samira Shabanian 6 *Equal contribution 1School of Computer Science, Mc Gill University 2Mila Qu ebec AI Institute 3Work was done during internship at Microsoft Research 4Wand X 5Canada CIFAR AI Chair 6Parts of this work were done during the author s affiliation with Microsoft Research. Correspondence to: Meng Cao <meng.cao@mail.mcgill.ca>.
Pseudocode	No	The paper describes algorithms (SARSA, Monte Carlo, N-step SARSA) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/mcao516/SFGen
Open Datasets	Yes	Following the experimental setup of Liu et al. (2021); Lu et al. (2022), we use the same dataset that contains 100K naturally occurring prompts from the Open Web Text (OWT) Corpus (Gokaslan & Cohen, 2019) for the sentiment control experiment. [...] We use the REALTOXICITYPROMPTS (RTP) benchmark (Gehman et al., 2020) for our detoxification experiments.
Dataset Splits	No	We use 90% of the sentences as our training set and 10% as the evaluation set. (The paper does not explicitly mention a 'validation' split distinct from 'test' for reproduction purposes, only an 'evaluation set' which can be ambiguous).
Hardware Specification	Yes	In order to evaluate the inference speed of our method relative to the baselines, we conducted measurements of the time required by each approach to generate 256 words using a single A100 GPU.
Software Dependencies	No	The paper mentions several software components like Hugging Face sentiment analysis classifier, GPT-2 (small), GPT2-XL, Perspective API, and Adam W optimizer, but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	For decoding, we use top-k sampling with k = 50 as suggested in Cao et al. (2023). [...] In Appendix A.2, Table 8 lists specific hyperparameters: 'gamma 1', 'epochs 3', 'batch size 6', 'warm-up steps 500', 'polyak update', 'lr 0.1', 'lr 3e-4', 'feature size 64', 'E 32', 'optimizer Adam W', 'scheduler type linear'.