Offline Reinforcement Learning with Pseudometric Learning
Authors: Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, Matthieu Geist
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we conduct an experimental study for the proposed approach. We evaluate it on a series of hand manipulation tasks (Rajeswaran et al., 2018), as well as Mu Jo Co locomotion tasks (Todorov et al., 2012; Brockman et al., 2016) with multiple data collection strategies from Fu et al. (2020). We first show the details of the learning procedure of the pseudometric, before showing its performance against several baselines from Fu et al. (2020). |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team 2University of Geneva 3Universit e de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France 4Univ. de Lille, CNRS, Inria Scool, UMR 9189 CRISt AL. |
| Pseudocode | Yes | Algorithm 1 Bonus learning. 1: Initialize Φ, Ψ networks. 2: for step i = 1 to N do 3: Train Φ: minΦ ˆLΦ 4: Train Ψ: minΨ ˆLΨ 5: Initialize k-nearest neighbors array H. 6: for step j = 1 to |D| do 7: Compute k-nearest neighbors of Ψ(sj). 8: Add k-nearest neighbors to the array H. Algorithm 2 Actor-Critic Training. 1: Initialize action-value network Qω, target network Q ω, Qω and policy πθ. 2: for step i = 0 to K do 3: Train Qω: minω Qω(s, a) r Q ω(s , πθ(s )) αc b(s , π(s )) 4: Train πθ: maxθ Qω(s, πθ(s)) + αa b(s, π(s)) 5: Update target network Q ω := Qω |
| Open Source Code | No | No explicit statement of open-source code release or a direct link to a repository for the paper's methodology was found. |
| Open Datasets | Yes | Finally, we lead an empirical study on the hand manipulation and locomotion tasks of the D4RL benchmark from Fu et al. (2020). |
| Dataset Splits | No | No specific information on training, validation, or test splits (e.g., percentages or counts) is provided. The paper mentions using D4RL datasets and refers to evaluations but not explicit split details for reproduction. |
| Hardware Specification | No | No specific hardware (GPU/CPU models, memory) used for running the experiments is mentioned. |
| Software Dependencies | No | The paper mentions 'scikit-learn (Pedregosa et al., 2011)' and 'JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax' but does not provide specific version numbers for these or other key software components, which is required for reproducibility. |
| Experiment Setup | Yes | State-action pairs are concatenated (and states only in the case of Ψ) and are passed to a 2-layer network of layers sizes (1024, 32), with a relu activation on top of the first layer. Note that the concatenation step could be preceded by two disjoint layers to which the state and action are passed (thus making it more handy for visual-based obseravations). We sample 256 actions to derive the bootstrapped estimate (loss ˆLΨ). We optimize ˆLΦ and ˆLΨ using the Adam optimizer (Kingma & Ba, 2015) with batches of state-action pairs and states of size 256. We ran a hyperparameter search on αa, αc {1, 5, 10} and β {0.1, 0.25, 0.5}. |