Imitating Language via Scalable Inverse Reinforcement Learning

Authors: Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jorg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, Sarah Bechtle, Jost Springenberg, Nikola Momchev, Olivier Bachem, Matthieu Geist, Martin Riedmiller

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments range from 250M to 3B parameter models for the encoder-decoder T5 [51] and decoder-only Pa LM2 [3] models. Throughout evaluation, we investigate task performance and diversity of model generations illustrating clear benefits of inverse RL over behavior cloning for imitation.
Researcher Affiliation Industry Google Deep Mind
Pseudocode No The paper includes mathematical derivations and descriptions of algorithms but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No NeurIPS Paper Checklist: Does the paper provide open access to the data and code...Answer: [No] Justification: We have been unable to provide public access to our code.
Open Datasets Yes We use the following datasets and subsets for ablation in the following sections: XSUM [43], GSM8k [14], TLDR [58], and WMT22 [34].
Dataset Splits Yes In particular, here we can change the second expectation s state-action distribution to one induced by the expert policy yielding: min π J (π) = min r EµE[f ( r) + vr Es γv r] (11) In Table 2, we report the Spearman s rank correlation coefficient between accumulated rewards (over complete sampled trajectories for the full validation sets) for online IQLearn (α = 0.1) and task-specific metrics.
Hardware Specification Yes Our experiments with T5 models use TPU v3 infrastructure and are running between approximately 3 days and 2 weeks. Our experiments with Pa LM2 models use TPU v4 infrastructure and are running under 1 week.
Software Dependencies No The paper describes the use of an 'Adam optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes Table 3: IQLearn and MLE hyperparameters Learning rate T5 1e-4 Learning rate Pa LM2 1e-4 Warmup steps 2000 Batch size T5 (base/large/xl) 32/32/16 Batch size Pa LM2 16 Random seeds / experiment 3