Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Non-Parallel Text Style Transfer with Self-Parallel Supervision
Authors: Ruibo Liu, Chongyang Gao, Chenyan Jia, Guangxuan Xu, Soroush Vosoughi
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 EXPERIMENTS |
| Researcher Affiliation | Academia | Dartmouth College, Northwestern University, University of Texas, Austin University of California, Los Angeles |
| Pseudocode | Yes | Algorithm 1: Sentence Distillation for Political Stance Dataset |
| Open Source Code | Yes | Code for La Mer is available at https://github.com/Dapang Liu/La Mer. |
| Open Datasets | Yes | Sentiment Transfer. We use the Yelp reviews dataset collected by Shen et al. (2017) which contains 250k negative sentences and 380k positive sentences, organized in non-parallel fashion. Formality Transfer. A more challenging TST task is to modify the formality of a given sentence. We use the GYAFC dataset (Rao & Tetreault, 2018), which contains formal and informal sentences from two domains. |
| Dataset Splits | Yes | Formality Transfer... which consists of about 52k training sentences, 5k development sentences, and 2.5k test sentences. |
| Hardware Specification | Yes | All of our experiments were run on a single RTX-2080 GPU, with batch size 4 and 2/3/2 epochs for La Mer in the above three TST tasks. |
| Software Dependencies | No | The paper mentions using pre-trained models like Ro BERTa and BART, citing their original papers, but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch, TensorFlow, HuggingFace transformers version). |
| Experiment Setup | Yes | All of our experiments were run on a single RTX-2080 GPU, with batch size 4 and 2/3/2 epochs for La Mer in the above three TST tasks. We choose the REINFORCE algorithm (Williams, 1992) to optimize the current policy πθ. Empirically we set Jsafe IL to {0.8, 0.6, 0.4} for the three TST tasks (sentiment, formality, and political stance). α controls the weights assigned to d Order and d Exist; set by running repeated experiments ranging the α from 0 to 1 by 0.1, and picking the best-performing α with respect to GM: α = {0.4, 0.3, 0.1} for the three tasks. The filtering parameter p and k are hyperparameters that are crucial for the construction of roughly parallel datasets. |