reproducibilityindex.ai

Efficient Empowerment Estimation for Unsupervised Stabilization

Authors: Ruihan Zhao, Kevin Lu, Pieter Abbeel, Stas Tiomkin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
Researcher Affiliation	Academia	Ruihan Zhao, Kevin Lu, Pieter Abbeel, Stas Tiomkin Berkeley Artiﬁcial Intelligence Research Lab Electrical Engineering and Computer Sciences University of California, Berkeley, CA, USA
Pseudocode	Yes	Algorithm 1 Latent Gaussian Channel Empowerment Maximization
Open Source Code	Yes	Project page: https://sites.google.com/view/latent-gce
Open Datasets	Yes	We demonstrate the advantages of our method through comparisons to the existing state-of-the-art empowerment estimators in different dynamical systems from the Open AI Gym simulator (Brockman et al. (2016)).
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits with specific percentages or counts. It mentions training and evaluating policies but not how the data for these processes is formally split.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies	No	The paper mentions 'standard RL algorithms such as Proximal Policy Optimization (Schulman et al. (2017)), and Soft Actor-Critic (Haarnoja et al. (2018)).' and 'PPO algorithm (Schulman et al., 2017) from Open AI baselines (Dhariwal et al., 2017)' but does not specify version numbers for these software components.
Experiment Setup	Yes	The policy is updated every 1000 steps in the environment, with learning rate of 1e-4 and γ = 0.5. ... For our experiments using DADS, we use policy and discriminator architecture of two hidden layers of size 256 with Re LU activations and a latent dimension of 2. Our discount factor is .99. As per suggestions from the original paper, we use 512 prior samples to approximate the intrinsic reward, run 32 discriminator updates and 128 policy updates per 1000 timesteps; learning rates of 3 × 10−4 and batch size of 256 are used throughout. We use a replay buffer to help stabilize training of size 20000, used only for policy updates. ... After a parameter search, we used discount factor γ = 0.95 over a total of 106 steps. The reward function that we choose is: R(s, a) = 1goal + β Emp(s)