Offline RL for Natural Language Generation with Implicit Language Q Learning
Authors: Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, Sergey Levine
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not. |
| Researcher Affiliation | Academia | Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, Sergey Levine UC Berkeley; {csnell22,kostrikov,suyi,sherryy,svlevine}@berkeley.edu} |
| Pseudocode | No | The paper describes its methods in narrative text and uses diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code at https://sea-snell.github.io/ILQL_site/ |
| Open Datasets | Yes | We use the Visual Dialogue dataset (Das et al., 2016) to evaluate our algorithm s ability to optimize many different reward functions in complex dialogue settings. |
| Dataset Splits | No | We use early stopping: when the validation loss exceeds the training loss, we stop training. While a "validation loss" is mentioned, the paper does not specify the exact percentages or sample counts for training, validation, or test splits. |
| Hardware Specification | Yes | All evaluations were performed on a single T4 GPU. |
| Software Dependencies | No | We use GPT-2 small as the base model for all transformers in our experiments. The paper mentions software components like GPT-2, RoBERTa-base, and AdamW optimizer, but it does not specify their version numbers. |
| Experiment Setup | Yes | We use GPT-2 small as the base model for all transformers in our experiments. Our value function transformer has three MLP heads: two independently initialized and trained Q heads and one V head. Each head has two layers, with a hidden dimension twice that of the embedding dimension. Our target Q networks are Polyak-averaged with decay factor 0.005 for both the transformer and the Q function head. We use γ = 0.99 for all offline-RL experiments. All value function heads are two layer MLPs with hidden dimension twice that of the transformer s embedding dimension. Our MLPs used Re LU non-linearities and no dropout. We used the Adam W optimizer for all experiments, with a learning rate of 1e-4 on the Reddit and Visual Dialogue tasks and 1e-5 on the Wordle task. We used no weight decay in the training any of our models, and we used a dropout rate of 0.1 inside the transformer. We trained all Wordle models with a batch size of 1024, all Visual Dialogue models with a batch size of 64, and all Reddit models with a batch size of 32. We always truncate token sequences to length 1024, except on Reddit tasks, in which we truncate to length 512. |