Dual Critic Reinforcement Learning under Partial Observability
Authors: Jinqiu Li, Enmin Zhao, Tong Wei, Junliang Xing, SHIMING XIANG
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental analyses across the Box2D and Box3D environments have verified DCRL s superior performance. |
| Researcher Affiliation | Academia | Jinqiu Li1,2, Enmin Zhao1,2, Tong Wei3, Junliang Xing3 , Shiming Xiang1,2 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Department of Computer Science and Technology, Tsinghua University |
| Pseudocode | No | The paper does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The source code is available in the supplementary material. |
| Open Datasets | Yes | Mini Grid [41] is a procedurally generated environment with goal-oriented tasks. ... Mini World [41] is a minimalistic Box3D interior environment simulator consisting of connected rooms with objects inside. |
| Dataset Splits | No | The paper evaluates on procedurally generated environments (Mini Grid, Mini World) where traditional fixed train/validation/test dataset splits are not explicitly defined in the text as percentages or counts. The evaluation is based on average returns over training frames, not specific data splits. |
| Hardware Specification | Yes | We run all experiments on a single server with 64 Intel(R) Xeon(R) Gold 5218 CPU processors @ 2.30GHz and 1 Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions optimizers like RMSprop and Adam but does not provide specific version numbers for software libraries or frameworks used. |
| Experiment Setup | Yes | The hyperparameters for training each method are summarized in Table 2. Hyperparameter Mini Grid Mini Grid Mini World Algorithm A2C PPO PPO Seeds in experiments 5 5 5 Discount factor γ 0.99 0.99 0.99 λ for GAE 1 0.95 0.95 Rollout steps 5 512 512 Number of workers 16 16 16 Entropy loss coef 0.01 0.01 0.01 Optimizer RMSprop Adam Adam learning rate 1e-3 3e-4 3e-4 max grad norm 0.5 0.5 0.5 PPO clip range 0.2 0.2 PPO training epochs 4 4 PPO mini-batch size 512 512 dual update per iteration 16 4 4 dual training epochs 4 8 8 dual batch size 640 2048 2048 β 0.5 0.5 Best chosen from {0.1, 0.5} |