reproducibilityindex.ai

For SALE: State-Action Representation Learning for Deep Reinforcement Learning

Authors: Scott Fujimoto, Wei-Di Chang, Edward Smith, Shixiang (Shane) Gu, Doina Precup, David Meger

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On Open AI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.
Researcher Affiliation	Collaboration	Scott Fujimoto Mila, Mc Gill University Wei-Di Chang Mc Gill University Edward J. Smith Mc Gill University Shixiang Shane Gu Google Deep Mind Doina Precup Mila, Mc Gill University David Meger Mila, Mc Gill University
Pseudocode	Yes	Algorithm 1 Online TD7; Algorithm 2 TD7 Train Function; Pseudocode 1. TD7 Network Details
Open Source Code	Yes	Our code is open-sourced1. 1https://github.com/sfujim/TD7
Open Datasets	Yes	Using Open AI gym [Brockman et al., 2016], we benchmark TD7... on the Mu Jo Co environments [Todorov et al., 2012]. We benchmark TD7... using the Mu Jo Co datasets in D4RL [Todorov et al., 2012, Fu et al., 2021].
Dataset Splits	No	The paper describes evaluation protocols and data collection, such as 'Agents are evaluated every 5000 time steps, taking the average undiscounted sum of rewards over 10 episodes.', but does not explicitly provide training/test/validation dataset splits with percentages or specific sample counts.
Hardware Specification	Yes	All experiments are run on a single Nvidia Titan X GPU and Intel Core i7-7700k CPU.
Software Dependencies	Yes	Python 3.9.13 Pytorch 2.0.0 [Paszke et al., 2019] CUDA version 11.8 Gym 0.25.0 [Brockman et al., 2016] Mu Jo Co 2.3.3 [Todorov et al., 2012]
Experiment Setup	Yes	Table 3: TD7 Hyperparameters. Hyperparameter Value Target policy noise σ N(0, 0.22) Target policy noise clipping c ( 0.5, 0.5) Policy update frequency 2 ... Mini-batch size 256 Target update frequency 250 Optimizer Adam [Kingma and Ba, 2014] (Shared) Learning rate 3e 4