CAQL: Continuous Action Q-Learning
Authors: Moonkyung Ryu, Yinlam Chow, Ross Anderson, Christian Tjandraatmadja, Craig Boutilier
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. ... We evaluate CAQL on one classical control benchmark (Pendulum) and five Mu Jo Co benchmarks (Hopper, Walker2D, Half Cheetah, Ant, Humanoid). |
| Researcher Affiliation | Industry | Moonkyung Ryu*, Yinlam Chow*, Ross Anderson, Christian Tjandraatmadja, Craig Boutilier Google Research {mkryu,yinlamchow,rander,ctjandra,cboutilier}@google.com |
| Pseudocode | Yes | Algorithm 1 Continuous Action Q-learning (CAQL) |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the open-sourcing of the CAQL implementation code developed in this paper. |
| Open Datasets | Yes | We evaluate CAQL on one classical control benchmark (Pendulum) and five Mu Jo Co benchmarks (Hopper, Walker2D, Half Cheetah, Ant, Humanoid). ... Table 6: Benchmark Environments. |
| Dataset Splits | No | The paper discusses training on data collected from simulation environments and stored in a replay buffer, sampling mini-batches for training. It does not specify fixed train/validation/test dataset splits in the conventional sense for static datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models or cloud instance types. |
| Software Dependencies | Yes | We use SCIP 6.0.0 (Gleixner et al., 2018) for the MIP solver. |
| Experiment Setup | Yes | Details on network architectures and hyperparameters are described in Appendix D. ... We use a two hidden layer neural network with Re LU activation (32 units in the first layer and 16 units in the second layer) for both the Q-function and the action function. ... A time limit of 60 seconds and a optimality gap limit of 10 4 are used for all experiments. For GA and CEM, a maximum iterations of 20 and a convergence threshold of 10 6 are used for all experiments if not stated otherwise. Table 7: Hyper parameters settings for CAQL and NAF. Table 8: Hyper parameters settings for DDPG, TD3, SAC. |