reproducibilityindex.ai

Outcome-Driven Reinforcement Learning via Variational Inference

Authors: Tim G. J. Rudner, Vitchyr Pong, Rowan McAllister, Yarin Gal, Sergey Levine

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that this method eliminates the need to handcraft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors. We demonstrate that unlike prior works that proposed inference methods for ﬁnding policies that achieve desired outcomes [3, 11, 19, 48], the resulting algorithm, Outcome-Driven Actor Critic (ODAC), is amenable to off-policy learning and applicable to complex, high-dimensional continuous control tasks over ﬁnite and inﬁnite horizons. In high-dimensional and non-linear domains, our method can be combined with deep neural network function approximators to yield a deep reinforcement learning method that does not require manual speciﬁcation of rewards, and leads to good performance on a range of benchmark tasks. We evaluate this algorithm Outcome-Driven Variational Inference (ODAC) on a range of reinforcement learning tasks without having to manually specify task-speciﬁc reward functions. In our experiments, we ﬁnd that our method results in signiﬁcantly faster learning across a variety of robot manipulation and locomotion tasks than alternative approaches.
Researcher Affiliation	Collaboration	Tim G. J. Rudner University of Oxford Vitchyr H. Pong University of California, Berkeley Rowan Mc Allister University of California, Berkeley Yarin Gal University of Oxford Sergey Levine University of California, Berkeley
Pseudocode	Yes	Algorithm 1 ODAC: Outcome-Driven Actor Critic
Open Source Code	No	The paper does not provide a direct link or explicit statement about the release of its own source code for the methodology described.
Open Datasets	Yes	Environments. We compare ODAC to prior work on a simple 2D navigation task, in which an agent must take non-greedy actions to move around a box, as well as the Ant, Sawyer Push, and Fetch Push simulated robot domains, which have each been studied in prior work on reinforcement learning for reaching goals [2, 28, 30, 34, 41]. For the Meta World tasks, this baseline uses the benchmark reward for each task. For the remaining environments, this baseline uses the Euclidean distance between the agent s current and the desired outcome for the reward. Baselines and Prior Work. We compare our method to hindsight experience replay (HER) [2] ... universal value density estimation (UVD) [41] ... DISCERN [51] ... Soft Actor Critic (SAC) [12, 16, 17].
Dataset Splits	No	The paper mentions training, but does not specify exact dataset split percentages or sample counts for training, validation, or test sets.
Hardware Specification	No	The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, such as libraries or frameworks.
Experiment Setup	No	The paper provides some high-level details regarding the experimental setup (e.g., uniform action prior, geometric time prior), but lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed configuration settings typically found in an "Experimental Setup" section.