Outcome-Driven Reinforcement Learning via Variational Inference

Authors: Tim G. J. Rudner, Vitchyr Pong, Rowan McAllister, Yarin Gal, Sergey Levine

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that this method eliminates the need to handcraft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors. We demonstrate that unlike prior works that proposed inference methods for finding policies that achieve desired outcomes [3, 11, 19, 48], the resulting algorithm, Outcome-Driven Actor Critic (ODAC), is amenable to off-policy learning and applicable to complex, high-dimensional continuous control tasks over finite and infinite horizons. In high-dimensional and non-linear domains, our method can be combined with deep neural network function approximators to yield a deep reinforcement learning method that does not require manual specification of rewards, and leads to good performance on a range of benchmark tasks. We evaluate this algorithm Outcome-Driven Variational Inference (ODAC) on a range of reinforcement learning tasks without having to manually specify task-specific reward functions. In our experiments, we find that our method results in significantly faster learning across a variety of robot manipulation and locomotion tasks than alternative approaches.
Researcher Affiliation Collaboration Tim G. J. Rudner University of Oxford Vitchyr H. Pong University of California, Berkeley Rowan Mc Allister University of California, Berkeley Yarin Gal University of Oxford Sergey Levine University of California, Berkeley
Pseudocode Yes Algorithm 1 ODAC: Outcome-Driven Actor Critic
Open Source Code No The paper does not provide a direct link or explicit statement about the release of its own source code for the methodology described.
Open Datasets Yes Environments. We compare ODAC to prior work on a simple 2D navigation task, in which an agent must take non-greedy actions to move around a box, as well as the Ant, Sawyer Push, and Fetch Push simulated robot domains, which have each been studied in prior work on reinforcement learning for reaching goals [2, 28, 30, 34, 41]. For the Meta World tasks, this baseline uses the benchmark reward for each task. For the remaining environments, this baseline uses the Euclidean distance between the agent s current and the desired outcome for the reward. Baselines and Prior Work. We compare our method to hindsight experience replay (HER) [2] ... universal value density estimation (UVD) [41] ... DISCERN [51] ... Soft Actor Critic (SAC) [12, 16, 17].
Dataset Splits No The paper mentions training, but does not specify exact dataset split percentages or sample counts for training, validation, or test sets.
Hardware Specification No The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as libraries or frameworks.
Experiment Setup No The paper provides some high-level details regarding the experimental setup (e.g., uniform action prior, geometric time prior), but lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed configuration settings typically found in an "Experimental Setup" section.