reproducibilityindex.ai

Belief Projection-Based Reinforcement Learning for Environments with Delayed Feedback

Authors: Jangwon Kim, Hangyeol Kim, Jiwook Kang, Jongchan Baek, Soohee Han

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compared the performance of the BPQL algorithm with the following three baselines4: ... Figure 2: Performance curves of each algorithm for the Walker2d-v3 task. ... Table 1: Results of Mu Jo Co benchmark tasks for one million interactions. Each task was evaluated in the delayed environment setting for 3,6, and 9 delayed timesteps d. ... We evaluated BPQL and other baselines on the noisy version of the Inverted Pendulum-v2 environment...
Researcher Affiliation	Collaboration	Jangwon Kim1 jangwonkim@postech.ac.kr Hangyeol Kim 2 hangyeol.kim@koreaaero.com Jiwook Kang2 jiwook.kang@koreaaero.com Jongchan Baek3 jcbaek@etri.re.kr Soohee Han1 soohee.han@postech.ac.kr 1Computational Control Engineering Lab., Pohang University of Science and Technology. 2 Korea Aerospace Industries, Ltd. 3 Electronics and Telecommunications Research Institute.
Pseudocode	Yes	Algorithm 1 Belief-Projection-Based Q-learning (BPQL)
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the proposed methodology (BPQL) is openly available.
Open Datasets	Yes	We tested the algorithms on several tasks using the Mu Jo Co benchmark [31] and evaluated their performances in environments with different numbers of delayed timesteps.5 Figure 2 shows that the augmented and model-based approaches are inappropriate for environments in which the delayed timestep is large, whereas the proposed BPQL algorithm exhibits significantly better performance in a long-delayed environment. ... We conducted additional experiments on the classical discrete control Open AI gym [7] tasks: Cart Pole-v1 and Lunar Lander-v2.
Dataset Splits	No	The paper describes training interaction steps (e.g., '1 million interactions') but does not specify explicit training/validation/test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU models, CPU models, or memory specifications used for running the experiments. It only implies that experiments were conducted.
Software Dependencies	No	The paper mentions several algorithms and tools like 'Adam optimizer' and 'Open AI gym', but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the implementation or experiments.
Experiment Setup	Yes	Table 3: Hyperparameters for BPQL and the baselines. Hyperparameters Values Critic network 256, 256 Policy network 256, 256 Discount factor 0.99 Replay memory size 1 M Minibatch size 256 Learning rate 0.0003 Target entropy -dim\|A\| Target smoothing coefficient 0.995 Optimizer Adam