Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Offline RL for Natural Language Generation with Implicit Language Q Learning
Authors: Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, Sergey Levine
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not. |
| Researcher Affiliation | Academia | Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, Sergey Levine UC Berkeley; EMAIL} |
| Pseudocode | No | The paper describes its methods in narrative text and uses diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code at https://sea-snell.github.io/ILQL_site/ |
| Open Datasets | Yes | We use the Visual Dialogue dataset (Das et al., 2016) to evaluate our algorithm s ability to optimize many different reward functions in complex dialogue settings. |
| Dataset Splits | No | We use early stopping: when the validation loss exceeds the training loss, we stop training. While a "validation loss" is mentioned, the paper does not specify the exact percentages or sample counts for training, validation, or test splits. |
| Hardware Specification | Yes | All evaluations were performed on a single T4 GPU. |
| Software Dependencies | No | We use GPT-2 small as the base model for all transformers in our experiments. The paper mentions software components like GPT-2, RoBERTa-base, and AdamW optimizer, but it does not specify their version numbers. |
| Experiment Setup | Yes | We use GPT-2 small as the base model for all transformers in our experiments. Our value function transformer has three MLP heads: two independently initialized and trained Q heads and one V head. Each head has two layers, with a hidden dimension twice that of the embedding dimension. Our target Q networks are Polyak-averaged with decay factor 0.005 for both the transformer and the Q function head. We use γ = 0.99 for all offline-RL experiments. All value function heads are two layer MLPs with hidden dimension twice that of the transformer s embedding dimension. Our MLPs used Re LU non-linearities and no dropout. We used the Adam W optimizer for all experiments, with a learning rate of 1e-4 on the Reddit and Visual Dialogue tasks and 1e-5 on the Wordle task. We used no weight decay in the training any of our models, and we used a dropout rate of 0.1 inside the transformer. We trained all Wordle models with a batch size of 1024, all Visual Dialogue models with a batch size of 64, and all Reddit models with a batch size of 32. We always truncate token sequences to length 1024, except on Reddit tasks, in which we truncate to length 512. |