Efficient Empowerment Estimation for Unsupervised Stabilization

Authors: Ruihan Zhao, Kevin Lu, Pieter Abbeel, Stas Tiomkin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
Researcher Affiliation Academia Ruihan Zhao, Kevin Lu, Pieter Abbeel, Stas Tiomkin Berkeley Artificial Intelligence Research Lab Electrical Engineering and Computer Sciences University of California, Berkeley, CA, USA
Pseudocode Yes Algorithm 1 Latent Gaussian Channel Empowerment Maximization
Open Source Code Yes Project page: https://sites.google.com/view/latent-gce
Open Datasets Yes We demonstrate the advantages of our method through comparisons to the existing state-of-the-art empowerment estimators in different dynamical systems from the Open AI Gym simulator (Brockman et al. (2016)).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages or counts. It mentions training and evaluating policies but not how the data for these processes is formally split.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies No The paper mentions 'standard RL algorithms such as Proximal Policy Optimization (Schulman et al. (2017)), and Soft Actor-Critic (Haarnoja et al. (2018)).' and 'PPO algorithm (Schulman et al., 2017) from Open AI baselines (Dhariwal et al., 2017)' but does not specify version numbers for these software components.
Experiment Setup Yes The policy is updated every 1000 steps in the environment, with learning rate of 1e-4 and γ = 0.5. ... For our experiments using DADS, we use policy and discriminator architecture of two hidden layers of size 256 with Re LU activations and a latent dimension of 2. Our discount factor is .99. As per suggestions from the original paper, we use 512 prior samples to approximate the intrinsic reward, run 32 discriminator updates and 128 policy updates per 1000 timesteps; learning rates of 3 × 10−4 and batch size of 256 are used throughout. We use a replay buffer to help stabilize training of size 20000, used only for policy updates. ... After a parameter search, we used discount factor γ = 0.95 over a total of 106 steps. The reward function that we choose is: R(s, a) = 1goal + β Emp(s)