Robust Reinforcement Learning via Progressive Task Sequence

Authors: Yike Li, Yunzhe Tian, Endong Tong, Wenjia Niu, Jiqiang Liu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, extensive experiments demonstrate that the proposed method exhibits significant performance on the unmanned Car Racing game and multiple high-dimensional Mu Jo Co environments.
Researcher Affiliation Academia Yike Li , Yunzhe Tian , Endong Tong , Wenjia Niu and Jiqiang Liu Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University {yikeli, tianyunzhe, edtong, niuwj, jqliu}@bjtu.edu.cn
Pseudocode Yes Algorithm 1 Iterative training of DRRL
Open Source Code Yes 1The code is available at https://github.com/li-yike/DRRL.
Open Datasets Yes We implement the Hopper, Half Cheetah, and Walker2d benchmarks using Open AI gym [Brockman et al., 2016] with the Mu Jo Co simulator [Todorov et al., 2012]. ... The ablation study is implemented in a new benchmark Car Racing, a top-down car racing from pixels environment with a three-dimensional continuous action space (i.e., steer, gas, brake).
Dataset Splits No The paper does not explicitly provide details about training, validation, and test splits (e.g., percentages or sample counts) for the datasets used. While hyperparameters are mentioned as being tuned by grid search, a specific validation set or splitting methodology for validation is not described.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or specific cloud instances) used for running the experiments. It mentions the use of 'Mu Jo Co simulator' and 'neural network' but no hardware specifications.
Software Dependencies No The paper mentions software components like 'Open AI gym' and 'Mu Jo Co simulator', and algorithms such as 'TRPO', 'RARL', 'RAP', 'DR', and 'PPO'. However, it does not specify version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup Yes We set the learning rate as 0.01 and other hyper-parameters (e.g., batchsize, discount factor) are tuned by grid search. For the GA implementation, the population size NP, the parent population size K, and the mutation rate Pm are set as 250, 50 and 0.9, respectively. ... In our experiments, we use TRPO as the policy optimizer implemented by a neural network consisting of three hidden layers with 100 neurons each.