Hindsight Value Function for Variance Reduction in Stochastic Dynamic Environment
Authors: Jiaming Guo, Rui Zhang, Xishan Zhang, Shaohui Peng, Qi Yi, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we apply the proposed hindsight value function in stochastic dynamic environments, including discrete-action environments and continuous-action environments. |
| Researcher Affiliation | Collaboration | 1SKL of Computer Architecture, Institute of Computing Technology, CAS, Beijing, China 2Cambricon Technologies 3University of Chinese Academy of Sciences, China 4University of Science and Technology of China 5CAS Center for Excellence in Brain Science and Intelligence Technology, CEBSIT |
| Pseudocode | Yes | Algorithm 1 Learning hindsight vector |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release) for the source code of the methodology. |
| Open Datasets | Yes | We evaluate the hindsight value function for the continuous-action environment using the Mu Jo Co robotic simulations in Open AI Gym[Brockman et al., 2016]. We start from two 8 8 versions of the grid-world environment. |
| Dataset Splits | No | The paper mentions general experimental settings such as episode length and discount factor, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components such as A2C, PPO, LSTM, MuJoCo, and OpenAI Gym, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Settings. We set the max length of an episode as 20 for both environments. The discounted factor γ is set as 0.99. And we perform the advantage actor-critic(A2C) algorithm on these environments. The actor and critic are implemented as two independent neural networks that are composed of several fully-connected layers. For the hindsight value function, we directly replace the critic with the architecture in Figure 1. Note that for estimating a single value, the new critic maintains the same network architecture but gets an additional input of the LSTM hidden state. Thus, we exclude the influence of the ability of neural networks. We run all these trials with three random seeds. |