Independence-aware Advantage Estimation
Authors: Pushi Zhang, Li Zhao, Guoqing Liu, Jiang Bian, Minlie Huang, Tao Qin, Tie-Yan Liu
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our method achieves higher sample efficiency compared with existing advantage estimation methods in complex environments. Empirically, we show that our estimated advantage function is closer to ground-truth advantage function Aπ than existing advantage estimation methods such as Monte-Carlo and Generalized Advantage Estimation [Schulman et al., 2015b]. We also test IAE advantage estimation in policy optimization settings on environments with high-dimensional observations, showing that our method outperforms other advantage estimation methods in sample efficiency. Results of our experiments are reported in Section 7. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Microsoft Research Asia 3University of Science and Technology of China |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper mentions 'Finite-state MDPs' and 'Pixel Grid World environment' which are custom environments built by the authors, but does not provide concrete access information (link, DOI, specific repository, or formal citation to an established public dataset) for these environments as datasets. |
| Dataset Splits | No | The paper describes 'Finite-state MDP settings' and 'Pixel Grid World environment' but does not specify any training, validation, or test dataset splits in terms of percentages, sample counts, or references to predefined standard splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries). |
| Experiment Setup | Yes | For GAE, we use λ = 0.95. We train tabular reward decomposition for 10000 episodes. In the per-step punishment setting, the agent gets r = 0.03 reward in every step before reaching its goal, r = 1 reward when reaching its goal for the first time, and r = 0 reward for every step after reaching its goal. In no punishment setting, the agent gets r = 1 reward when reaching its goal for the first time, and gets r = 0 reward otherwise. |