Belief Projection-Based Reinforcement Learning for Environments with Delayed Feedback
Authors: Jangwon Kim, Hangyeol Kim, Jiwook Kang, Jongchan Baek, Soohee Han
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compared the performance of the BPQL algorithm with the following three baselines4: ... Figure 2: Performance curves of each algorithm for the Walker2d-v3 task. ... Table 1: Results of Mu Jo Co benchmark tasks for one million interactions. Each task was evaluated in the delayed environment setting for 3,6, and 9 delayed timesteps d. ... We evaluated BPQL and other baselines on the noisy version of the Inverted Pendulum-v2 environment... |
| Researcher Affiliation | Collaboration | Jangwon Kim1 jangwonkim@postech.ac.kr Hangyeol Kim 2 hangyeol.kim@koreaaero.com Jiwook Kang2 jiwook.kang@koreaaero.com Jongchan Baek3 jcbaek@etri.re.kr Soohee Han1 soohee.han@postech.ac.kr 1Computational Control Engineering Lab., Pohang University of Science and Technology. 2 Korea Aerospace Industries, Ltd. 3 Electronics and Telecommunications Research Institute. |
| Pseudocode | Yes | Algorithm 1 Belief-Projection-Based Q-learning (BPQL) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the proposed methodology (BPQL) is openly available. |
| Open Datasets | Yes | We tested the algorithms on several tasks using the Mu Jo Co benchmark [31] and evaluated their performances in environments with different numbers of delayed timesteps.5 Figure 2 shows that the augmented and model-based approaches are inappropriate for environments in which the delayed timestep is large, whereas the proposed BPQL algorithm exhibits significantly better performance in a long-delayed environment. ... We conducted additional experiments on the classical discrete control Open AI gym [7] tasks: Cart Pole-v1 and Lunar Lander-v2. |
| Dataset Splits | No | The paper describes training interaction steps (e.g., '1 million interactions') but does not specify explicit training/validation/test dataset splits with percentages, sample counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU models, or memory specifications used for running the experiments. It only implies that experiments were conducted. |
| Software Dependencies | No | The paper mentions several algorithms and tools like 'Adam optimizer' and 'Open AI gym', but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the implementation or experiments. |
| Experiment Setup | Yes | Table 3: Hyperparameters for BPQL and the baselines. Hyperparameters Values Critic network 256, 256 Policy network 256, 256 Discount factor 0.99 Replay memory size 1 M Minibatch size 256 Learning rate 0.0003 Target entropy -dim|A| Target smoothing coefficient 0.995 Optimizer Adam |