Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

Authors: Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, P. R. Kumar, Chao Tian

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement the proposed RNAC in multiple Mu Jo Co environments (Hopper-v3, Walker2dv3, and Half Cheetah-v3), and demonstrate that RNAC with the proposed uncertainty sets results in robust behavior while canonical policy-based approaches suffer significant performance degradation. We also test RNAC on Turtle Bot [5], a real-world mobile robot, performing a navigation task.
Researcher Affiliation Academia Ruida Zhou Texas A&M University ruida@tamu.edu Tao Liu Texas A&M University tliu@tamu.edu Min Cheng Texas A&M University minrara0404@tamu.edu Dileep Kalathil Texas A&M University dileep.kalathil@tamu.edu P. R. Kumar Texas A&M University prk@tamu.edu Chao Tian Texas A&M University chao.tian@tamu.edu
Pseudocode Yes Algorithm 1: Robust Natural Actor-Critic; Algorithm 2: Robust Linear Temporal Difference (RLTD); Algorithm 3: Robust Q-Natural Policy Gradient (RQNPG)
Open Source Code Yes A video of the demonstration on Turtle Bot is available at [Video Link] and the RNAC code is provided in the supplementary material.
Open Datasets No The paper mentions training on "multiple Mu Jo Co environments (Hopper-v3, Walker2dv3, and Half Cheetah-v3)" and a "real-world Turtle Bot navigation task." These are simulation environments and a physical robot, not static publicly available datasets with explicit access information like a link, DOI, or formal citation for data files.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It mentions training data as 'on-policy trajectory data' or 'batch of on-policy data' and parameters like 'maximum training steps' and 'batch size', but no specific percentages or counts for data partitioning.
Hardware Specification Yes All experimental results are carried out on a Linux server with 48-core RTX 6000 GPUs, 48-core Intel Xeon 6248R CPUs, and 384 GB DDR4 RAM.
Software Dependencies No The paper mentions using ADAM [24] as an optimizer and a neural Gaussian policy [48] but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes We use the same hyperparameters across different Mu Jo Co environments. Specifically, we select γ = 0.99 for the discount factor, ηt = αt = 3 10 4 for learning rates of both actor and critic updates implemented by ADAM [24], T = 3 106 for the maximum training steps, and B = 2048 for the batch size.