Master-Slave Curriculum Design for Reinforcement Learning

Authors: Yuechen Wu, Wei Zhang, Ke Song

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation on the Viz Doom platform demonstrates the joint learning of master agent and slave agents mutually benefit each other. Significant improvement is obtained over A3C in terms of learning speed and performance.
Researcher Affiliation Academia Yuechen Wu, Wei Zhang , Ke Song School of Control Science and Engineering, Shandong University {wuyuechen, songke vsislab}@mail.sdu.edu.cn, davidzhangsdu@gmail.com
Pseudocode Yes Algorithm 1: Master-Slave Curriculum Learning
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets No The paper states, "Evaluation is conducted on the Viz Doom platform" and describes several scenarios. While Viz Doom is a known platform, the paper does not specify a publicly available dataset with concrete access information (link, citation with authors/year) used for training specific to their experiments.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages, sample counts, or references to predefined splits needed for reproduction. It refers to 'n-step returns' and 'tmax steps' which are related to the update process, not data splitting.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions "RMSProp was performed to optimize the network in Tensor Flow." but does not provide specific version numbers for TensorFlow or any other software dependencies.
Experiment Setup Yes For all experiments, we set the discount factor γ = 0.99, the RMSProp decay factor α = 0.99, the exploration rate ϵ = 0.1, and the entropy regularization term β = 0.01. ... In the experiment, we used 16 threads and performed updates after every 80 actions (i.e., tmax = 20 and m = 4).