FlowPG: Action-constrained Policy Gradient with Normalizing Flows

Authors: Janaka Brahmanage, Jiajing LING, Akshat Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks.
Researcher Affiliation Academia Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar School of Computing and Information Systems Singapore Management University
Pseudocode Yes Algorithm 1 Flow PG Algorithm
Open Source Code Yes The source code of our implementation is publicly available1. 1https://github.com/rlr-smu/flow-pg
Open Datasets Yes We evaluate our proposed approach on four continuous control RL tasks in the Mu Jo Co environment [31] and one resource allocation problem that has been used in previous works [4, 18].
Dataset Splits No The paper describes experiments in reinforcement learning environments and does not mention explicit training/validation/test dataset splits in terms of percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types).
Software Dependencies No The paper mentions software components like 'Adam Optimizer' and 'Pytorch and Tensorflow' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We apply batch gradient descent to train the conditional flow with Adam optimizer and a batch size of 5000. For Reacher, Half Cheetah, Hopper and Walker2d environments, we train the model for 5K epochs. For BSS environment, we train for 20K epochs. Further details about the training the flow such as learning rates, and neural network architecture of the model are provided in the Appendix B... Adam [14] optimizer was used to train both Actor and Critic networks with a learning rate of 10-4 and 10-3 respectively. For the soft target update, we used τ = 0.001. We used mini-batches of size 64 to train both Actor and Critic networks. We used Gaussian noise with µ = 0 and σ = 0.1 as action noise for exploration. The replay buffer size was 10^6.