Successor Features for Efficient Multi-Subject Controlled Text Generation
Authors: Meng Cao, Mehdi Fatemi, Jackie Ck Cheung, Samira Shabanian
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on two NLG tasks: sentiment control and detoxification. Through our evaluation, we demonstrate the effectiveness of our approach in steering the model away from undesired sentiment and in substantially reducing the generation of harmful content. Our method outperforms five baseline models in both tasks and is on par with the SOTA. |
| Researcher Affiliation | Collaboration | Meng Cao * 1 2 3 Mehdi Fatemi * 4 Jackie Chi Kit Cheung 1 2 5 Samira Shabanian 6 *Equal contribution 1School of Computer Science, Mc Gill University 2Mila Qu ebec AI Institute 3Work was done during internship at Microsoft Research 4Wand X 5Canada CIFAR AI Chair 6Parts of this work were done during the author s affiliation with Microsoft Research. Correspondence to: Meng Cao <meng.cao@mail.mcgill.ca>. |
| Pseudocode | No | The paper describes algorithms (SARSA, Monte Carlo, N-step SARSA) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/mcao516/SFGen |
| Open Datasets | Yes | Following the experimental setup of Liu et al. (2021); Lu et al. (2022), we use the same dataset that contains 100K naturally occurring prompts from the Open Web Text (OWT) Corpus (Gokaslan & Cohen, 2019) for the sentiment control experiment. [...] We use the REALTOXICITYPROMPTS (RTP) benchmark (Gehman et al., 2020) for our detoxification experiments. |
| Dataset Splits | No | We use 90% of the sentences as our training set and 10% as the evaluation set. (The paper does not explicitly mention a 'validation' split distinct from 'test' for reproduction purposes, only an 'evaluation set' which can be ambiguous). |
| Hardware Specification | Yes | In order to evaluate the inference speed of our method relative to the baselines, we conducted measurements of the time required by each approach to generate 256 words using a single A100 GPU. |
| Software Dependencies | No | The paper mentions several software components like Hugging Face sentiment analysis classifier, GPT-2 (small), GPT2-XL, Perspective API, and Adam W optimizer, but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For decoding, we use top-k sampling with k = 50 as suggested in Cao et al. (2023). [...] In Appendix A.2, Table 8 lists specific hyperparameters: 'gamma 1', 'epochs 3', 'batch size 6', 'warm-up steps 500', 'polyak update', 'lr 0.1', 'lr 3e-4', 'feature size 64', 'E 32', 'optimizer Adam W', 'scheduler type linear'. |