Mutual Information State Intrinsic Control

Authors: Rui Zhao, Yang Gao, Pieter Abbeel, Volker Tresp, Wei Xu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our contributions are three-fold. First, we propose a novel intrinsic motivation (MUSIC) that encourages the agent to have maximum control on its surrounding, based on the natural agentsurrounding separation assumption. Secondly, we propose scalable objectives that make the MUSIC intrinsic reward easy to optimize. Last but not least, we show MUSIC s superior performance, by comparing it with other competitive intrinsic rewards on multiple environments.
Researcher Affiliation Collaboration Rui Zhao1,2 , Yang Gao3, Pieter Abbeel4, Volker Tresp1,2, Wei Xu5 1Ludwig Maximilian University of Munich 2Siemens AG 3Tsinghua University 4University of California, Berkeley 5Horizon Robotics
Pseudocode Yes Algorithm 1: MUSIC while not converged do Sample an initial state s0 p(s0). for t 1 to steps per episode do Sample action at πθ(at | st). Step environment st+1 p(st+1 | st, at). Sample transitions T from the buffer. Set intrinsic reward r = Iφ(Ss; Sa | T ). Update policy (θ) via DDPG or SAC. Update the MI estimator (φ) with SGD. Figure 1: MUSIC Algorithm: We update the estimator to better predict the MI, and update the agent to control the surrounding state to have higher MI with the agent state.
Open Source Code Yes Our code is available at https://github.com/ruizhaogit/music and https://github.com/ruizhaogit/alf.
Open Datasets Yes To evaluate the proposed methods, we used the robotic manipulation tasks and a navigation task, see Figure 2 (Brockman et al., 2016; Plappert et al., 2018). The navigation task is based on the Gazebo simulator.
Dataset Splits No The paper describes experimental setups and training procedures but does not explicitly provide details about a validation dataset split in terms of specific percentages or sample counts, which is common in traditional supervised learning.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory details) to run its experiments.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'DDPG', 'SAC', 'Open AI Gym', and 'Gazebo simulator', but does not provide specific version numbers for any of them.
Experiment Setup Yes We ran all the methods in each environment with 5 different random seeds and report the mean success rate and the standard deviation. The experiments of the robotic manipulation tasks in this paper use the following hyper-parameters: Actor and critic networks: 3 layers with 256 units each and Re LU non-linearities Adam optimizer (Kingma & Ba, 2014) with 1 10 3 for training both actor and critic Buffer size: 106 transitions Polyak-averaging coefficient: 0.95 Action L2 norm coefficient: 1.0 Observation clipping: [ 200, 200] Batch size: 256 Rollouts per MPI worker: 2 Number of MPI workers: 16 Cycles per epoch: 50 Batches per cycle: 40 Test rollouts per epoch: 10 Probability of random actions: 0.3 Scale of additive Gaussian noise: 0.2 Scale of the mutual information reward: 5000