Outcome-Driven Reinforcement Learning via Variational Inference
Authors: Tim G. J. Rudner, Vitchyr Pong, Rowan McAllister, Yarin Gal, Sergey Levine
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that this method eliminates the need to handcraft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors. We demonstrate that unlike prior works that proposed inference methods for finding policies that achieve desired outcomes [3, 11, 19, 48], the resulting algorithm, Outcome-Driven Actor Critic (ODAC), is amenable to off-policy learning and applicable to complex, high-dimensional continuous control tasks over finite and infinite horizons. In high-dimensional and non-linear domains, our method can be combined with deep neural network function approximators to yield a deep reinforcement learning method that does not require manual specification of rewards, and leads to good performance on a range of benchmark tasks. We evaluate this algorithm Outcome-Driven Variational Inference (ODAC) on a range of reinforcement learning tasks without having to manually specify task-specific reward functions. In our experiments, we find that our method results in significantly faster learning across a variety of robot manipulation and locomotion tasks than alternative approaches. |
| Researcher Affiliation | Collaboration | Tim G. J. Rudner University of Oxford Vitchyr H. Pong University of California, Berkeley Rowan Mc Allister University of California, Berkeley Yarin Gal University of Oxford Sergey Levine University of California, Berkeley |
| Pseudocode | Yes | Algorithm 1 ODAC: Outcome-Driven Actor Critic |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the release of its own source code for the methodology described. |
| Open Datasets | Yes | Environments. We compare ODAC to prior work on a simple 2D navigation task, in which an agent must take non-greedy actions to move around a box, as well as the Ant, Sawyer Push, and Fetch Push simulated robot domains, which have each been studied in prior work on reinforcement learning for reaching goals [2, 28, 30, 34, 41]. For the Meta World tasks, this baseline uses the benchmark reward for each task. For the remaining environments, this baseline uses the Euclidean distance between the agent s current and the desired outcome for the reward. Baselines and Prior Work. We compare our method to hindsight experience replay (HER) [2] ... universal value density estimation (UVD) [41] ... DISCERN [51] ... Soft Actor Critic (SAC) [12, 16, 17]. |
| Dataset Splits | No | The paper mentions training, but does not specify exact dataset split percentages or sample counts for training, validation, or test sets. |
| Hardware Specification | No | The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, such as libraries or frameworks. |
| Experiment Setup | No | The paper provides some high-level details regarding the experimental setup (e.g., uniform action prior, geometric time prior), but lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed configuration settings typically found in an "Experimental Setup" section. |