Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
Authors: Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, Pieter Abbeel
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and highdimensional hand manipulation and synthetic tasks. |
| Researcher Affiliation | Collaboration | 1 Department of EECS, UC Berkeley 2 Department of CSE, University of Washington 3 Open AI 4 Institute for Transportation Studies, UC Berkeley |
| Pseudocode | Yes | Algorithm 1 Policy gradient for factorized policies using action-dependent baselines |
| Open Source Code | No | Videos and additional results of the paper are available at https://sites.google.com/view/ad-baselines. This link points to a project page with videos and additional results, but does not explicitly provide access to the source code for the methodology. |
| Open Datasets | No | The paper uses simulated environments like Mu Jo Co and synthetic tasks, which generate data dynamically through interaction rather than using pre-existing, publicly available datasets for which concrete access information could be provided. Therefore, the concept of a 'publicly available dataset' as per the question's criteria doesn't directly apply here. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages or counts for training, validation, or testing sets). The experiments are conducted in simulated environments where data is generated through interaction. |
| Hardware Specification | No | The paper mentions using the 'Mu Jo Co 1.5 simulator' but does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co 1.5 simulator' but does not list other key software components (e.g., programming languages, deep learning frameworks, or libraries) with their specific version numbers required for reproducibility. |
| Experiment Setup | Yes | Parameters: Unless otherwise stated, the following parameters are used in the experiments in this work: γ = 0.995, λGAE = 0.97, kldesired = 0.025. Policies: The policies used are 2-layer fully connected networks with hidden sizes=(32, 32). Initialization: the policy is initialized with Xavier initialization except final layer weights are scaled down (by a factor of 100x). Table 2 and Table 3 provide further details on per-experiment configurations, including trajectories, horizon, RBF features, and action dimensionality. |