Predictive Information Accelerates Learning in RL

Authors: Kuang-Huei Lee, Ian Fischer, Anthony Liu, Yijie Guo, Honglak Lee, John Canny, Sergio Guadarrama

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that PI-SAC agents can substantially improve sample efficiency over challenging baselines on tasks from the DM Control suite of continuous control environments. We evaluate PI-SAC agents by comparing against uncompressed PI-SAC agents, other compressed and uncompressed agents, and SAC agents directly trained from pixels.
Researcher Affiliation Collaboration Kuang-Huei Lee Google Research leekh@google.com Ian Fischer Google Research iansf@google.com Anthony Z. Liu University of Michigan anthliu@umich.edu Yijie Guo University of Michigan guoyijie@umich.edu Honglak Lee Google Research honglak@google.com John Canny Google Research canny@google.com Sergio Guadarrama Google Research sguada@google.com
Pseudocode Yes Algorithm 1 Training Algorithm for PI-SAC
Open Source Code Yes Our implementation is given on Git Hub.1 https://github.com/google-research/pisac
Open Datasets Yes We evaluate PI-SAC on the Deep Mind control suite [42] and compare with leading model-free and model-based approaches for continuous control from pixels
Dataset Splits No The paper uses continuous control environments and does not specify traditional train/validation/test dataset splits with percentages or counts, as is common in supervised learning.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using standard hyperparameters and architectures similar to other works, but does not list specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes Throughout these experiments we mostly use the standard SAC hyperparameters [16], including the sizes of the actor and critic networks, learning rates, and target critic update rate. Unless otherwise specified, we set CEB β = 0.01. We report our results with the best number of gradient updates per environment step in Section 4.1, and use one gradient update per environment step for the rest of the experiments. Full details of hyperparameters are listed in Section A.2.