Offline Actor-Critic Reinforcement Learning Scales to Large Models

Authors: Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Maria Elisabeth Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that offline actor-critic reinforcement learning can scale to large models such as transformers and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset; containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actorcritic model and elucidate the key features needed to make offline RL work with selfand crossattention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
Researcher Affiliation Industry 1Google Deepmind. Correspondence to: Jost Tobias Springenberg <springenberg@google.com>.
Pseudocode Yes Algorithm 1 Perceiver Actor Critic Model
Open Source Code No The paper provides a link to videos of their agent but does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository.
Open Datasets Yes We use a large dataset throughout all experiments which combines tasks from three different sources: Gato data (Reed et al., 2022) consist of records of an RL agent solving 32 simulation tasks in Control Suite (Tunyasuvunakool et al., 2020). Robo Cat data (Bousmalis et al., 2023) operates on the RGB Stacking benchmark (Lee et al., 2021) using RL in simulation to build pyramid and tower structures using a 7-Do F Panda robot. ... Lastly, CHEF (Lampe et al., 2023) data contains simulated and real-world records of a 5-Do F Sawyer robot stacking two objects in the RGB Stacking benchmark using an RL algorithm.
Dataset Splits No The paper evaluates model performance during training but does not explicitly define a separate validation dataset split with percentages or sample counts. It refers to 'evaluating six checkpoints of each training run' to derive 'return profiles'.
Hardware Specification Yes It is also worth noting that the L-sized version of PAC runs at 20 Hz on a local Nvidia RTX 3090 GPU during this real-robot experiment.
Software Dependencies No The paper mentions several software components like 'Sentence Piece tokenizer (Kudo & Richardson, 2018)' and 'Adam W optimizer (Loshchilov & Hutter, 2017)' and 'Res Net (He et al., 2016)' but does not provide specific version numbers for general programming languages or core machine learning frameworks/libraries used for implementation (e.g., Python, TensorFlow, PyTorch).
Experiment Setup Yes For all large-scale experiments (cf. Sections 4.1 and 4.2) we use optimizer hyperparameters as reported in Table 12. Importantly, we use the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate schedule which starts at lr init, ramps up linearly for lr peak and is then cosine-annealed to lr end over lr decay steps which amounts to approximately one epoch in our data mix. ... We set lr decay steps according to the respective epoch lengths for each of our experiments, i.e. to 2.7e6 in the scaling experiments and to 4.7e6 in the pre-training experiments. For all PAC models, we keep the TD loss scale β constant at 38 while varying the BC vs RL trade-off α between 1.0 for our BC+Q and Filtered BC baselines and 0.75 for the PAC model series.