f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergences
Authors: Siddhant Agarwal, Ishan Durugkar, Peter Stone, Amy Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that f-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and Fetch Reach environments. More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html. |
| Researcher Affiliation | Collaboration | Siddhant Agarwal The University of Texas at Austin siddhant@cs.utexas.edu Ishan Durugkar Sony AI ishan.durugkar@sony.com Peter Stone The University of Texas at Austin Sony AI pstone@cs.utexas.edu Amy Zhang The University of Texas at Austin amy.zhang@austin.utexas.edu |
| Pseudocode | Yes | Algorithm 1 f-PG |
| Open Source Code | No | More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html. This website is a project overview page, and indicates code as "coming soon", not currently available. |
| Open Datasets | Yes | We use the Point Maze environments (Fu et al., 2020) which are a set of offline RL environments, and modify it to support our online algorithms. |
| Dataset Splits | No | The paper describes the environments and reports success rates, but does not provide specific details on how the data was split into training, validation, and testing sets. |
| Hardware Specification | No | The paper does not mention any specific hardware used for running the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions software like Soft Q Learning and PPO implementations but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper describes the general algorithm and environment characteristics (e.g., goal distribution standard deviation for Reacher, initial state sampling for Point Maze) but does not provide specific hyperparameters such as learning rate, batch size, or optimizer settings used for training the models in the experiments. |