Imitating Language via Scalable Inverse Reinforcement Learning
Authors: Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jorg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, Sarah Bechtle, Jost Springenberg, Nikola Momchev, Olivier Bachem, Matthieu Geist, Martin Riedmiller
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments range from 250M to 3B parameter models for the encoder-decoder T5 [51] and decoder-only Pa LM2 [3] models. Throughout evaluation, we investigate task performance and diversity of model generations illustrating clear benefits of inverse RL over behavior cloning for imitation. |
| Researcher Affiliation | Industry | Google Deep Mind |
| Pseudocode | No | The paper includes mathematical derivations and descriptions of algorithms but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | NeurIPS Paper Checklist: Does the paper provide open access to the data and code...Answer: [No] Justification: We have been unable to provide public access to our code. |
| Open Datasets | Yes | We use the following datasets and subsets for ablation in the following sections: XSUM [43], GSM8k [14], TLDR [58], and WMT22 [34]. |
| Dataset Splits | Yes | In particular, here we can change the second expectation s state-action distribution to one induced by the expert policy yielding: min π J (π) = min r EµE[f ( r) + vr Es γv r] (11) In Table 2, we report the Spearman s rank correlation coefficient between accumulated rewards (over complete sampled trajectories for the full validation sets) for online IQLearn (α = 0.1) and task-specific metrics. |
| Hardware Specification | Yes | Our experiments with T5 models use TPU v3 infrastructure and are running between approximately 3 days and 2 weeks. Our experiments with Pa LM2 models use TPU v4 infrastructure and are running under 1 week. |
| Software Dependencies | No | The paper describes the use of an 'Adam optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | Table 3: IQLearn and MLE hyperparameters Learning rate T5 1e-4 Learning rate Pa LM2 1e-4 Warmup steps 2000 Batch size T5 (base/large/xl) 32/32/16 Batch size Pa LM2 16 Random seeds / experiment 3 |