reproducibilityindex.ai

Offline Reinforcement Learning with Pseudometric Learning

Authors: Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, Matthieu Geist

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we conduct an experimental study for the proposed approach. We evaluate it on a series of hand manipulation tasks (Rajeswaran et al., 2018), as well as Mu Jo Co locomotion tasks (Todorov et al., 2012; Brockman et al., 2016) with multiple data collection strategies from Fu et al. (2020). We ﬁrst show the details of the learning procedure of the pseudometric, before showing its performance against several baselines from Fu et al. (2020).
Researcher Affiliation	Collaboration	1Google Research, Brain Team 2University of Geneva 3Universit e de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France 4Univ. de Lille, CNRS, Inria Scool, UMR 9189 CRISt AL.
Pseudocode	Yes	Algorithm 1 Bonus learning. 1: Initialize Φ, Ψ networks. 2: for step i = 1 to N do 3: Train Φ: minΦ ˆLΦ 4: Train Ψ: minΨ ˆLΨ 5: Initialize k-nearest neighbors array H. 6: for step j = 1 to \|D\| do 7: Compute k-nearest neighbors of Ψ(sj). 8: Add k-nearest neighbors to the array H. Algorithm 2 Actor-Critic Training. 1: Initialize action-value network Qω, target network Q ω, Qω and policy πθ. 2: for step i = 0 to K do 3: Train Qω: minω Qω(s, a) r Q ω(s , πθ(s )) αc b(s , π(s )) 4: Train πθ: maxθ Qω(s, πθ(s)) + αa b(s, π(s)) 5: Update target network Q ω := Qω
Open Source Code	No	No explicit statement of open-source code release or a direct link to a repository for the paper's methodology was found.
Open Datasets	Yes	Finally, we lead an empirical study on the hand manipulation and locomotion tasks of the D4RL benchmark from Fu et al. (2020).
Dataset Splits	No	No specific information on training, validation, or test splits (e.g., percentages or counts) is provided. The paper mentions using D4RL datasets and refers to evaluations but not explicit split details for reproduction.
Hardware Specification	No	No specific hardware (GPU/CPU models, memory) used for running the experiments is mentioned.
Software Dependencies	No	The paper mentions 'scikit-learn (Pedregosa et al., 2011)' and 'JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax' but does not provide specific version numbers for these or other key software components, which is required for reproducibility.
Experiment Setup	Yes	State-action pairs are concatenated (and states only in the case of Ψ) and are passed to a 2-layer network of layers sizes (1024, 32), with a relu activation on top of the ﬁrst layer. Note that the concatenation step could be preceded by two disjoint layers to which the state and action are passed (thus making it more handy for visual-based obseravations). We sample 256 actions to derive the bootstrapped estimate (loss ˆLΨ). We optimize ˆLΦ and ˆLΨ using the Adam optimizer (Kingma & Ba, 2015) with batches of state-action pairs and states of size 256. We ran a hyperparameter search on αa, αc {1, 5, 10} and β {0.1, 0.25, 0.5}.