reproducibilityindex.ai

Imitating Language via Scalable Inverse Reinforcement Learning

Authors: Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jorg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, Sarah Bechtle, Jost Springenberg, Nikola Momchev, Olivier Bachem, Matthieu Geist, Martin Riedmiller

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments range from 250M to 3B parameter models for the encoder-decoder T5 [51] and decoder-only Pa LM2 [3] models. Throughout evaluation, we investigate task performance and diversity of model generations illustrating clear benefits of inverse RL over behavior cloning for imitation.
Researcher Affiliation	Industry	Google Deep Mind
Pseudocode	No	The paper includes mathematical derivations and descriptions of algorithms but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	NeurIPS Paper Checklist: Does the paper provide open access to the data and code...Answer: [No] Justification: We have been unable to provide public access to our code.
Open Datasets	Yes	We use the following datasets and subsets for ablation in the following sections: XSUM [43], GSM8k [14], TLDR [58], and WMT22 [34].
Dataset Splits	Yes	In particular, here we can change the second expectation s state-action distribution to one induced by the expert policy yielding: min π J (π) = min r EµE[f ( r) + vr Es γv r] (11) In Table 2, we report the Spearman s rank correlation coefficient between accumulated rewards (over complete sampled trajectories for the full validation sets) for online IQLearn (α = 0.1) and task-specific metrics.
Hardware Specification	Yes	Our experiments with T5 models use TPU v3 infrastructure and are running between approximately 3 days and 2 weeks. Our experiments with Pa LM2 models use TPU v4 infrastructure and are running under 1 week.
Software Dependencies	No	The paper describes the use of an 'Adam optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	Table 3: IQLearn and MLE hyperparameters Learning rate T5 1e-4 Learning rate Pa LM2 1e-4 Warmup steps 2000 Batch size T5 (base/large/xl) 32/32/16 Batch size Pa LM2 16 Random seeds / experiment 3