FlowPG: Action-constrained Policy Gradient with Normalizing Flows
Authors: Janaka Brahmanage, Jiajing LING, Akshat Kumar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks. |
| Researcher Affiliation | Academia | Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar School of Computing and Information Systems Singapore Management University |
| Pseudocode | Yes | Algorithm 1 Flow PG Algorithm |
| Open Source Code | Yes | The source code of our implementation is publicly available1. 1https://github.com/rlr-smu/flow-pg |
| Open Datasets | Yes | We evaluate our proposed approach on four continuous control RL tasks in the Mu Jo Co environment [31] and one resource allocation problem that has been used in previous works [4, 18]. |
| Dataset Splits | No | The paper describes experiments in reinforcement learning environments and does not mention explicit training/validation/test dataset splits in terms of percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types). |
| Software Dependencies | No | The paper mentions software components like 'Adam Optimizer' and 'Pytorch and Tensorflow' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We apply batch gradient descent to train the conditional flow with Adam optimizer and a batch size of 5000. For Reacher, Half Cheetah, Hopper and Walker2d environments, we train the model for 5K epochs. For BSS environment, we train for 20K epochs. Further details about the training the flow such as learning rates, and neural network architecture of the model are provided in the Appendix B... Adam [14] optimizer was used to train both Actor and Critic networks with a learning rate of 10-4 and 10-3 respectively. For the soft target update, we used τ = 0.001. We used mini-batches of size 64 to train both Actor and Critic networks. We used Gaussian noise with µ = 0 and σ = 0.1 as action noise for exploration. The replay buffer size was 10^6. |