Offline RL for Natural Language Generation with Implicit Language Q Learning

Authors: Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, Sergey Levine

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not.
Researcher Affiliation Academia Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, Sergey Levine UC Berkeley; {csnell22,kostrikov,suyi,sherryy,svlevine}@berkeley.edu}
Pseudocode No The paper describes its methods in narrative text and uses diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Code at https://sea-snell.github.io/ILQL_site/
Open Datasets Yes We use the Visual Dialogue dataset (Das et al., 2016) to evaluate our algorithm s ability to optimize many different reward functions in complex dialogue settings.
Dataset Splits No We use early stopping: when the validation loss exceeds the training loss, we stop training. While a "validation loss" is mentioned, the paper does not specify the exact percentages or sample counts for training, validation, or test splits.
Hardware Specification Yes All evaluations were performed on a single T4 GPU.
Software Dependencies No We use GPT-2 small as the base model for all transformers in our experiments. The paper mentions software components like GPT-2, RoBERTa-base, and AdamW optimizer, but it does not specify their version numbers.
Experiment Setup Yes We use GPT-2 small as the base model for all transformers in our experiments. Our value function transformer has three MLP heads: two independently initialized and trained Q heads and one V head. Each head has two layers, with a hidden dimension twice that of the embedding dimension. Our target Q networks are Polyak-averaged with decay factor 0.005 for both the transformer and the Q function head. We use γ = 0.99 for all offline-RL experiments. All value function heads are two layer MLPs with hidden dimension twice that of the transformer s embedding dimension. Our MLPs used Re LU non-linearities and no dropout. We used the Adam W optimizer for all experiments, with a learning rate of 1e-4 on the Reddit and Visual Dialogue tasks and 1e-5 on the Wordle task. We used no weight decay in the training any of our models, and we used a dropout rate of 0.1 inside the transformer. We trained all Wordle models with a batch size of 1024, all Visual Dialogue models with a batch size of 64, and all Reddit models with a batch size of 32. We always truncate token sequences to length 1024, except on Reddit tasks, in which we truncate to length 512.