reproducibilityindex.ai

Dual Critic Reinforcement Learning under Partial Observability

Authors: Jinqiu Li, Enmin Zhao, Tong Wei, Junliang Xing, SHIMING XIANG

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental analyses across the Box2D and Box3D environments have veriﬁed DCRL s superior performance.
Researcher Affiliation	Academia	Jinqiu Li1,2, Enmin Zhao1,2, Tong Wei3, Junliang Xing3 , Shiming Xiang1,2 1Institute of Automation, Chinese Academy of Sciences 2School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences 3Department of Computer Science and Technology, Tsinghua University
Pseudocode	No	The paper does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The source code is available in the supplementary material.
Open Datasets	Yes	Mini Grid [41] is a procedurally generated environment with goal-oriented tasks. ... Mini World [41] is a minimalistic Box3D interior environment simulator consisting of connected rooms with objects inside.
Dataset Splits	No	The paper evaluates on procedurally generated environments (Mini Grid, Mini World) where traditional fixed train/validation/test dataset splits are not explicitly defined in the text as percentages or counts. The evaluation is based on average returns over training frames, not specific data splits.
Hardware Specification	Yes	We run all experiments on a single server with 64 Intel(R) Xeon(R) Gold 5218 CPU processors @ 2.30GHz and 1 Tesla V100 GPU.
Software Dependencies	No	The paper mentions optimizers like RMSprop and Adam but does not provide specific version numbers for software libraries or frameworks used.
Experiment Setup	Yes	The hyperparameters for training each method are summarized in Table 2. Hyperparameter Mini Grid Mini Grid Mini World Algorithm A2C PPO PPO Seeds in experiments 5 5 5 Discount factor γ 0.99 0.99 0.99 λ for GAE 1 0.95 0.95 Rollout steps 5 512 512 Number of workers 16 16 16 Entropy loss coef 0.01 0.01 0.01 Optimizer RMSprop Adam Adam learning rate 1e-3 3e-4 3e-4 max grad norm 0.5 0.5 0.5 PPO clip range 0.2 0.2 PPO training epochs 4 4 PPO mini-batch size 512 512 dual update per iteration 16 4 4 dual training epochs 4 8 8 dual batch size 640 2048 2048 β 0.5 0.5 Best chosen from {0.1, 0.5}