Exponentially Weighted Imitation Learning for Batched Historical Data
Authors: Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, Tong Zhang
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology. |
| Researcher Affiliation | Collaboration | 1Tencent AI Lab 2Northwestern University {drwang, jcxiong, lxhan, pythonsun}@tencent.com hanliu@northwestern.edu, tongzhang@tongzhang-ml.org |
| Pseudocode | Yes | Algorithm 1 Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release) for the source code of the described methodology. |
| Open Datasets | No | The paper mentions using environments like HFO and TORCS, and collecting 'human replay files' for King of Glory, but it does not provide concrete access information (links, citations with author/year, or specific repository names) to any publicly available or open datasets used for training. |
| Dataset Splits | No | The paper does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'A DNN based function approximator is adopted' for King of Glory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) that are needed to replicate the experiment. |
| Experiment Setup | Yes | For the HFO game, we model the 3 discrete actions with multinomial probabilities and the 2 continuous parameters for each action with normal distributions of known σ = 0.2 but unknown µ... Then we optimize the policy loss Lp and the value loss Lv simultaneously, with a mixture coefficient cv as a hyper-parameter (by default cv = 1). Each experiment is repeated 3 times and the average of scores is reported in Figure 1. Additional details of the algorithm settings are given in Appendix B.2. |