Provably Efficient Maximum Entropy Exploration
Authors: Elad Hazan, Sham Kakade, Karan Singh, Abby Van Soest
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a proof of concept, we implement the proposed method and demonstrate experiments over several mainstream RL tasks in Section 5. Section 5. Proof of Concept Experiments. We report the results from a preliminary set of experiments. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Princeton University 2Google AI Princeton 3Allen School of Computer Science and Engineering, University of Washington 4Department of Statistics, University of Washington. |
| Pseudocode | Yes | Algorithm 1 Maximum-entropy policy computation. Algorithm 2 Sample-based planning for an unknown MDP. Algorithm 3 Sample-based estimate of the state distribution. |
| Open Source Code | Yes | The open-source implementations may be found at https: //github.com/abbyvansoest/maxent. |
| Open Datasets | Yes | Pendulum. The 2-dimensional state space for Pendulum (from (Brockman et al., 2016)) was discretized evenly to a grid of dimension 8 8. Ant. The 29-dimensional state space for Ant (with a Mujoco engine). Humanoid. The 376-dimensional state space for the Mujoco Humanoid environment. |
| Dataset Splits | No | The paper describes how training was conducted (e.g., 'trained on 200 episodes'), but does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions 'scikit-learn (Pedregosa et al., 2011)' but does not provide a specific version number for scikit-learn or any other software dependency. |
| Experiment Setup | Yes | Reward functional. Each planning agent was trained to maximize a smooth variant of the KL divergence objective. The smoothing parameter was chosen to be σ = |S|−1. Pendulum. The planning oracle is a REINFORCE (Sutton et al., 2000) agent, where the the output policy from the previous iteration is used as the initial policy for the next iteration. The policy class is a neural net with a single hidden layer consisting of 128 units. The agent is trained on 200 episodes every epoch. Ant. The planning oracle is a Soft Actor-Critic (Haarnoja et al., 2018) agent. The policy class is a neural net with 2 hidden layers composed of 300 units and the Re LU activation function. The agent is trained for 30 episodes, each of which consists of a roll-out of 5000 steps. The mixed policy is executed over 10 trials of T = 10000 steps at the end of each epoch. |