Robust Reinforcement Learning via Progressive Task Sequence
Authors: Yike Li, Yunzhe Tian, Endong Tong, Wenjia Niu, Jiqiang Liu
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, extensive experiments demonstrate that the proposed method exhibits significant performance on the unmanned Car Racing game and multiple high-dimensional Mu Jo Co environments. |
| Researcher Affiliation | Academia | Yike Li , Yunzhe Tian , Endong Tong , Wenjia Niu and Jiqiang Liu Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University {yikeli, tianyunzhe, edtong, niuwj, jqliu}@bjtu.edu.cn |
| Pseudocode | Yes | Algorithm 1 Iterative training of DRRL |
| Open Source Code | Yes | 1The code is available at https://github.com/li-yike/DRRL. |
| Open Datasets | Yes | We implement the Hopper, Half Cheetah, and Walker2d benchmarks using Open AI gym [Brockman et al., 2016] with the Mu Jo Co simulator [Todorov et al., 2012]. ... The ablation study is implemented in a new benchmark Car Racing, a top-down car racing from pixels environment with a three-dimensional continuous action space (i.e., steer, gas, brake). |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, and test splits (e.g., percentages or sample counts) for the datasets used. While hyperparameters are mentioned as being tuned by grid search, a specific validation set or splitting methodology for validation is not described. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or specific cloud instances) used for running the experiments. It mentions the use of 'Mu Jo Co simulator' and 'neural network' but no hardware specifications. |
| Software Dependencies | No | The paper mentions software components like 'Open AI gym' and 'Mu Jo Co simulator', and algorithms such as 'TRPO', 'RARL', 'RAP', 'DR', and 'PPO'. However, it does not specify version numbers for any of these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We set the learning rate as 0.01 and other hyper-parameters (e.g., batchsize, discount factor) are tuned by grid search. For the GA implementation, the population size NP, the parent population size K, and the mutation rate Pm are set as 250, 50 and 0.9, respectively. ... In our experiments, we use TRPO as the policy optimizer implemented by a neural network consisting of three hidden layers with 100 neurons each. |