Spectral Decomposition Representation for Reinforcement Learning

Authors: Tongzheng Ren, Tianjun Zhang, Lisa Lee, Joseph E. Gonzalez, Dale Schuurmans, Bo Dai

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several RL benchmarks. We evaluate SPEDER on the dense-reward Mu Jo Co tasks (Brockman et al., 2016) and sparse-reward Deep Mind Control Suite tasks (Tassa et al., 2018). In Mu Jo Co tasks, we compare with model-based (e.g., PETS (Chua et al., 2018), ME-TRPO (Kurutach et al., 2018)) and model-free baselines (e.g., SAC (Haarnoja et al., 2018), PPO (Schulman et al., 2017)), showing strong performance compared to So TA RL algorithms. In particular, we find that in the sparse reward Deep Mind Control tasks, the optimistic SPEDER significantly outperforms the So TA model-free RL algorithms. We also evaluate the method on offline behavioral cloning tasks in the Ant Maze environment using the D4RL benchmark (Fu et al., 2020), and show comparable results to state-of-the-art representation learning methods. Additional details about the experiment setup are described in Appendix F.
Researcher Affiliation Collaboration Tongzheng Ren1, 2, Tianjun Zhang1, 3, Lisa Lee1 Joseph Gonzalez3 Dale Schuurmans1, 4 Bo Dai1, 5 1Google Research, Brain Team 2UT Austin 3UC Berkeley 4University of Alberta 5Georgia Tech
Pseudocode Yes Algorithm 1 Online Exploration with SPEDER; Algorithm 2 Offline Policy Optimization with SPEDER
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets Yes We evaluate SPEDER on the dense-reward Mu Jo Co tasks (Brockman et al., 2016) and sparse-reward Deep Mind Control Suite tasks (Tassa et al., 2018). We also evaluate the method on offline behavioral cloning tasks in the Ant Maze environment using the D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits or detailed splitting methodology.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments with specific models or types.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018) for policy training but does not provide specific version numbers for software dependencies.
Experiment Setup Yes F.1 ONLINE SETTING We list all the hyperparameter and network architecture we use for our experiments. For online Mu Jo Co and DM Control tasks, the hyperparameters can be found at Table 4. Table 4: Hyperparameters used for SPEDER in all the environments in Mu Jo Co and DM Control Suite. Hyperparameter Value C 1.0 regularization coef 1.0 Bonus Coefficient (Mu Jo Co) 0.0 Bonus Coefficient (DM Control) 5.0 Actor lr 0.0003 Model lr 0.0003 Actor Network Size (Mu Jo Co) (256, 256) Actor Network Size (DM Control) (1024, 1024) SVD Embedding Network Size (Mu Jo Co) (1024, 1024, 1024) SVD Embedding Network Size (DM Control) (1024, 1024, 1024) Critic Network Size (Mu Jo Co) (1024, 1) Critic Network Size (DM Control) (1024, 1) Discount 0.99 Target Update Tau 0.005 Model Update Tau 0.005 Batch Size 256