Episodic Exploration for Deep Deterministic Policies for StarCraft Micromanagement

Authors: Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, Soumith Chintala

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that this algorithm allows to successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle.
Researcher Affiliation Industry Nicolas Usunier , Gabriel Synnaeve , Zeming Lin, Soumith Chintala Facebook AI Research {usunier,gab,zlin,soumith}@fb.com
Pseudocode Yes Algorithm 1: Zero-order (ZO) backpropagation algorithm
Open Source Code No The paper uses Torch7 and refers to Torchcraft (Synnaeve et al., 2016) which is a library, but does not explicitly state that the source code for their own methodology described in the paper is open-source or publicly released.
Open Datasets No The paper uses scenarios from the real-time strategy game Star Craft as benchmarks but does not provide concrete access information (link, DOI, repository, or formal citation with authors/year) for a publicly available or open dataset. Instead, it describes interaction with the game engine as a simulation environment.
Dataset Splits No The paper refers to 'training scenarios' and 'out-of-training-domain maps' but does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning. It mentions 'validation' as a key in the output schema but no text explicitly covers it.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Torch7' and specific optimizers (RMSProp, Adagrad) but does not provide specific version numbers for these or other ancillary software components needed to replicate the experiment.
Experiment Setup Yes We ran all the following experiments with a skip_frames of 9 (meaning that we take about 2.6 actions per unit per second). We optimize all the models after each battle (episode), with RMSProp (momentum 0.99 or 0.95), except for zero-order for which we optimized with Adagrad (Adagrad did not seem to work better for DQN nor REINFORCE). In any case, the learning rate was chosen among {10 2, 10 3, 10 4}. For Q-learning (DQN), we tried two schemes of annealing for epsilon greedy, ϵ = ϵ0 1+ϵa.ϵ0.t with t the optimization batch, and ϵ = max(0.01, ϵ0 ϵa.t), Both with ϵ0 {0.1, 1}, and respectively ϵa {0, ϵ0} and ϵa {10 5, 10 4, 10 3}. For REINFORCE we searched over τ {0.1, 0.5, 1, 10}. For zero-order, we tried δ {0.1, 0.01, 0.001}.