For SALE: State-Action Representation Learning for Deep Reinforcement Learning
Authors: Scott Fujimoto, Wei-Di Chang, Edward Smith, Shixiang (Shane) Gu, Doina Precup, David Meger
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On Open AI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings. |
| Researcher Affiliation | Collaboration | Scott Fujimoto Mila, Mc Gill University Wei-Di Chang Mc Gill University Edward J. Smith Mc Gill University Shixiang Shane Gu Google Deep Mind Doina Precup Mila, Mc Gill University David Meger Mila, Mc Gill University |
| Pseudocode | Yes | Algorithm 1 Online TD7; Algorithm 2 TD7 Train Function; Pseudocode 1. TD7 Network Details |
| Open Source Code | Yes | Our code is open-sourced1. 1https://github.com/sfujim/TD7 |
| Open Datasets | Yes | Using Open AI gym [Brockman et al., 2016], we benchmark TD7... on the Mu Jo Co environments [Todorov et al., 2012]. We benchmark TD7... using the Mu Jo Co datasets in D4RL [Todorov et al., 2012, Fu et al., 2021]. |
| Dataset Splits | No | The paper describes evaluation protocols and data collection, such as 'Agents are evaluated every 5000 time steps, taking the average undiscounted sum of rewards over 10 episodes.', but does not explicitly provide training/test/validation dataset splits with percentages or specific sample counts. |
| Hardware Specification | Yes | All experiments are run on a single Nvidia Titan X GPU and Intel Core i7-7700k CPU. |
| Software Dependencies | Yes | Python 3.9.13 Pytorch 2.0.0 [Paszke et al., 2019] CUDA version 11.8 Gym 0.25.0 [Brockman et al., 2016] Mu Jo Co 2.3.3 [Todorov et al., 2012] |
| Experiment Setup | Yes | Table 3: TD7 Hyperparameters. Hyperparameter Value Target policy noise σ N(0, 0.22) Target policy noise clipping c ( 0.5, 0.5) Policy update frequency 2 ... Mini-batch size 256 Target update frequency 250 Optimizer Adam [Kingma and Ba, 2014] (Shared) Learning rate 3e 4 |