reproducibilityindex.ai

Hindsight Value Function for Variance Reduction in Stochastic Dynamic Environment

Authors: Jiaming Guo, Rui Zhang, Xishan Zhang, Shaohui Peng, Qi Yi, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we apply the proposed hindsight value function in stochastic dynamic environments, including discrete-action environments and continuous-action environments.
Researcher Affiliation	Collaboration	1SKL of Computer Architecture, Institute of Computing Technology, CAS, Beijing, China 2Cambricon Technologies 3University of Chinese Academy of Sciences, China 4University of Science and Technology of China 5CAS Center for Excellence in Brain Science and Intelligence Technology, CEBSIT
Pseudocode	Yes	Algorithm 1 Learning hindsight vector
Open Source Code	No	The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release) for the source code of the methodology.
Open Datasets	Yes	We evaluate the hindsight value function for the continuous-action environment using the Mu Jo Co robotic simulations in Open AI Gym[Brockman et al., 2016]. We start from two 8 8 versions of the grid-world environment.
Dataset Splits	No	The paper mentions general experimental settings such as episode length and discount factor, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits).
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software components such as A2C, PPO, LSTM, MuJoCo, and OpenAI Gym, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Settings. We set the max length of an episode as 20 for both environments. The discounted factor γ is set as 0.99. And we perform the advantage actor-critic(A2C) algorithm on these environments. The actor and critic are implemented as two independent neural networks that are composed of several fully-connected layers. For the hindsight value function, we directly replace the critic with the architecture in Figure 1. Note that for estimating a single value, the new critic maintains the same network architecture but gets an additional input of the LSTM hidden state. Thus, we exclude the inﬂuence of the ability of neural networks. We run all these trials with three random seeds.