Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Non-Parallel Text Style Transfer with Self-Parallel Supervision

Authors: Ruibo Liu, Chongyang Gao, Chenyan Jia, Guangxuan Xu, Soroush Vosoughi

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTS
Researcher Affiliation Academia Dartmouth College, Northwestern University, University of Texas, Austin University of California, Los Angeles
Pseudocode Yes Algorithm 1: Sentence Distillation for Political Stance Dataset
Open Source Code Yes Code for La Mer is available at https://github.com/Dapang Liu/La Mer.
Open Datasets Yes Sentiment Transfer. We use the Yelp reviews dataset collected by Shen et al. (2017) which contains 250k negative sentences and 380k positive sentences, organized in non-parallel fashion. Formality Transfer. A more challenging TST task is to modify the formality of a given sentence. We use the GYAFC dataset (Rao & Tetreault, 2018), which contains formal and informal sentences from two domains.
Dataset Splits Yes Formality Transfer... which consists of about 52k training sentences, 5k development sentences, and 2.5k test sentences.
Hardware Specification Yes All of our experiments were run on a single RTX-2080 GPU, with batch size 4 and 2/3/2 epochs for La Mer in the above three TST tasks.
Software Dependencies No The paper mentions using pre-trained models like Ro BERTa and BART, citing their original papers, but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch, TensorFlow, HuggingFace transformers version).
Experiment Setup Yes All of our experiments were run on a single RTX-2080 GPU, with batch size 4 and 2/3/2 epochs for La Mer in the above three TST tasks. We choose the REINFORCE algorithm (Williams, 1992) to optimize the current policy πθ. Empirically we set Jsafe IL to {0.8, 0.6, 0.4} for the three TST tasks (sentiment, formality, and political stance). α controls the weights assigned to d Order and d Exist; set by running repeated experiments ranging the α from 0 to 1 by 0.1, and picking the best-performing α with respect to GM: α = {0.4, 0.3, 0.1} for the three tasks. The filtering parameter p and k are hyperparameters that are crucial for the construction of roughly parallel datasets.