Extrapolative Controlled Sequence Generation via Iterative Refinement

Authors: Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P Parikh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. ... We evaluate our approach in both the natural language and protein domains. For text generation, we generate reviews with a sentiment either more positive or negative than seen in the training data. For protein engineering, we present results on two tasks generating mutations of the ACE2 protein that have higher stability measured by Fold X (Schymkowitz et al., 2005) and generating mutations of an adeno-associated virus (AAV) capsid protein (Bryant et al., 2021) with a higher fitness value. ICE achieves consistent extrapolation on these three tasks, outperforming both standard methods for controlled generation such as PPLM (Dathathri et al., 2020) and a state-of-the-art extrapolative controlled generation method, Genhance (Chan et al., 2021a).
Researcher Affiliation Collaboration Vishakh Padmakumar 1 Richard Yuanzhe Pang 1 He He 1 Ankur P. Parikh 2 1New York University 2Google Deep Mind. Correspondence to: Vishakh Padmakumar <vishakh@nyu.edu>.
Pseudocode No The paper describes the Iterative Controlled Extrapolation (ICE) approach and its steps in prose, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes 1Our code and models are available at https://github.com/ vishakhpk/iter-extrapolation.
Open Datasets Yes We use the Yelp dataset for this task (Zhang et al., 2015)... This synthetic task was created in Chan et al. (2021a) and we replicate their setup... The AAV dataset (Bryant et al., 2021) aims to study the fitness landscape of an adeno-associated virus (AAV) capsid protein...
Dataset Splits No We use the Yelp dataset for this task (Zhang et al., 2015), which consists of 650K training examples and 50K test examples, evenly divided into sentiment scores from 1 to 5. ... We use the splits proposed by the FLIP benchmark (Dallago et al., 2021) for our experiments. Each mutant is a sequence of length varying from 734 to 750. Mutations are made on the wild-type sequence between indices 561 and 588. We use the provided low-vs-high split of the dataset to demarcate the training region and extrapolation region. The paper mentions training and testing sets, and references existing splits (e.g., FLIP benchmark), but does not explicitly provide numerical details (percentages or counts) for a validation set across all experiments, nor does it detail a cross-validation setup.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper mentions software components and models like 'Hugging Face library (Wolf et al., 2020)', 'RoBERTa-Large model (Liu et al., 2019)', 'BART-Large model', 'T5-Base model (Raffel et al., 2022)', 'Prot Bert (Elnaggar et al., 2021)', and 'Prot-T5-XL (Elnaggar et al., 2021)'. While these point to specific publications or projects, explicit version numbers for the libraries or their dependencies (e.g., PyTorch version, Python version) are not provided.
Experiment Setup Yes For scorer-free inference, we use beam search with a beam size of 5. When performing scorer-guided inference, at each iteration, we generate 5 sequences using top-k sampling with k = 5 and a temperature of 0.7; we then select the best one using fs. We run 10 steps of iterative editing for both methods. ... We filter the pairs created by setting the hyperparameter δ = 0.4 (Section 3.2). ... For ACE2, we fine-tune a Prot Bert (Elnaggar et al., 2021) model... We sample token masks from a Bernoulli distribution with (p = 0.8). To filter small perturbations, we set δ to 1.5. ... We again use the recommended hyperparameters from the Hugging Face repository and sweep learning rates from 1e 6 to 1e 3.