Reinforcement Learning from Passive Data via Latent Intentions
Authors: Dibya Ghosh, Chethan Anand Bhateja, Sergey Levine
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the ability to learn useful features for downstream RL from a number of passive data sources, including learning from videos of different embodiments in the XMagical benchmark, and learning from raw You Tube videos for the offline Atari benchmark. |
| Researcher Affiliation | Academia | Dibya Ghosh 1 Chethan Bhateja 1 Sergey Levine 1 1UC Berkeley. Correspondence to: Dibya Ghosh <dibya@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Learning Intent-Conditioned Value Functions from Passive Data |
| Open Source Code | Yes | Accompanying code at https://github.com/dibyaghosh/icvf release. |
| Open Datasets | Yes | To encompass the many forms of passive data that we may wish to pre-train on, we evaluate on passive data from the D4RL benchmark (Fu et al., 2020), videos of agents with different embodiments in the XMagical benchmark (Toyer et al., 2020; Zakka et al., 2021), and scraped Youtube videos of Atari 2600 games (Bellemare et al., 2013). ... During pre-training, the agent receives a passive dataset of state-observation sequences of the agent moving to different locations in a maze (1 106 frames), which we construct by stripping all annotations from the publically available dataset. ... We build on the dataset of video released by Zakka et al. (2021) |
| Dataset Splits | No | The paper describes training and testing procedures and dataset sizes for these, but does not explicitly mention a separate 'validation' dataset split with specific percentages or counts. The term 'finetuning' is used for downstream tasks without specifying a validation split for hyperparameter tuning during that phase. |
| Hardware Specification | Yes | The research was supported by the TPU Research Cloud |
| Software Dependencies | No | The paper mentions specific algorithms and models used (e.g., IQL, CQL, QRDQN, Impala visual encoder) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | We perform the coarse hyperparameter sweep α [1, 10, 100] for each representation and domain to choose α, since the scales of the downstream losses can vary between domains, and the scales of the representation loss can vary between different methods. For our method, we set the expectile parameter for our method to α = 0.9 for all the tasks, and train using temporal difference learning with a target network lagging via polyak averaging with rate λ = 0.005. ... Pre-training proceeds for 250k timesteps for all methods... running for 1M steps |