CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration
Authors: Qisong Yang, Matthijs T.J. Spaan
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical Analysis We evaluate our method based on a wide variety of TASE benchmarks. |
| Researcher Affiliation | Academia | Qisong Yang, Matthijs T. J. Spaan Delft University of Technology, The Netherlands {q.yang, m.t.j.spaan}@tudelft.nl |
| Pseudocode | Yes | Algorithm 1: Constrained Entropy Maximization |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Finally, we test our method in a set of continuous control, high-dimensional environments from the Safety Gym suite (Todorov, Erez, and Tassa 2012; Ray, Achiam, and Amodei 2019): Point Goal (36D, Figure 1(d)), Car Button (56D, Figure 1(e)). |
| Dataset Splits | No | The paper does not specify exact split percentages or absolute sample counts for training, validation, and test datasets, nor does it reference predefined splits with citations for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Open AI Gym' and 'Safety Gym suite' but does not provide specific version numbers for these or any other ancillary software components required for replication. |
| Experiment Setup | Yes | We initialize the policy network as a multi-layer perceptron (MLP) with two hidden layers, each containing 256 units with ReLU activations. The value and cost networks are also MLPs with two hidden layers of 256 units. The training is conducted over 1000 epochs, and each epoch contains 2000 environment steps. The learning rate for the policy and value networks is 3e-4, and the learning rate for the Lagrangian multiplier is 5e-4. The discount factor γ is set to 0.99. The policy update is 20 times for each epoch, and the value update is 5 times. The mini-batch size is 256. We set the trust region threshold δ as 0.01. The number of trajectories N = 20 for Basic Nav, and N = 10 for the other environments. We set k = 50 for the k-NN estimator. |