Episodic Exploration for Deep Deterministic Policies for StarCraft Micromanagement
Authors: Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, Soumith Chintala
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that this algorithm allows to successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle. |
| Researcher Affiliation | Industry | Nicolas Usunier , Gabriel Synnaeve , Zeming Lin, Soumith Chintala Facebook AI Research {usunier,gab,zlin,soumith}@fb.com |
| Pseudocode | Yes | Algorithm 1: Zero-order (ZO) backpropagation algorithm |
| Open Source Code | No | The paper uses Torch7 and refers to Torchcraft (Synnaeve et al., 2016) which is a library, but does not explicitly state that the source code for their own methodology described in the paper is open-source or publicly released. |
| Open Datasets | No | The paper uses scenarios from the real-time strategy game Star Craft as benchmarks but does not provide concrete access information (link, DOI, repository, or formal citation with authors/year) for a publicly available or open dataset. Instead, it describes interaction with the game engine as a simulation environment. |
| Dataset Splits | No | The paper refers to 'training scenarios' and 'out-of-training-domain maps' but does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning. It mentions 'validation' as a key in the output schema but no text explicitly covers it. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Torch7' and specific optimizers (RMSProp, Adagrad) but does not provide specific version numbers for these or other ancillary software components needed to replicate the experiment. |
| Experiment Setup | Yes | We ran all the following experiments with a skip_frames of 9 (meaning that we take about 2.6 actions per unit per second). We optimize all the models after each battle (episode), with RMSProp (momentum 0.99 or 0.95), except for zero-order for which we optimized with Adagrad (Adagrad did not seem to work better for DQN nor REINFORCE). In any case, the learning rate was chosen among {10 2, 10 3, 10 4}. For Q-learning (DQN), we tried two schemes of annealing for epsilon greedy, ϵ = ϵ0 1+ϵa.ϵ0.t with t the optimization batch, and ϵ = max(0.01, ϵ0 ϵa.t), Both with ϵ0 {0.1, 1}, and respectively ϵa {0, ϵ0} and ϵa {10 5, 10 4, 10 3}. For REINFORCE we searched over τ {0.1, 0.5, 1, 10}. For zero-order, we tried δ {0.1, 0.01, 0.001}. |