Dual Critic Reinforcement Learning under Partial Observability

Authors: Jinqiu Li, Enmin Zhao, Tong Wei, Junliang Xing, SHIMING XIANG

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental analyses across the Box2D and Box3D environments have verified DCRL s superior performance.
Researcher Affiliation Academia Jinqiu Li1,2, Enmin Zhao1,2, Tong Wei3, Junliang Xing3 , Shiming Xiang1,2 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Department of Computer Science and Technology, Tsinghua University
Pseudocode No The paper does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The source code is available in the supplementary material.
Open Datasets Yes Mini Grid [41] is a procedurally generated environment with goal-oriented tasks. ... Mini World [41] is a minimalistic Box3D interior environment simulator consisting of connected rooms with objects inside.
Dataset Splits No The paper evaluates on procedurally generated environments (Mini Grid, Mini World) where traditional fixed train/validation/test dataset splits are not explicitly defined in the text as percentages or counts. The evaluation is based on average returns over training frames, not specific data splits.
Hardware Specification Yes We run all experiments on a single server with 64 Intel(R) Xeon(R) Gold 5218 CPU processors @ 2.30GHz and 1 Tesla V100 GPU.
Software Dependencies No The paper mentions optimizers like RMSprop and Adam but does not provide specific version numbers for software libraries or frameworks used.
Experiment Setup Yes The hyperparameters for training each method are summarized in Table 2. Hyperparameter Mini Grid Mini Grid Mini World Algorithm A2C PPO PPO Seeds in experiments 5 5 5 Discount factor γ 0.99 0.99 0.99 λ for GAE 1 0.95 0.95 Rollout steps 5 512 512 Number of workers 16 16 16 Entropy loss coef 0.01 0.01 0.01 Optimizer RMSprop Adam Adam learning rate 1e-3 3e-4 3e-4 max grad norm 0.5 0.5 0.5 PPO clip range 0.2 0.2 PPO training epochs 4 4 PPO mini-batch size 512 512 dual update per iteration 16 4 4 dual training epochs 4 8 8 dual batch size 640 2048 2048 β 0.5 0.5 Best chosen from {0.1, 0.5}