CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration

Authors: Qisong Yang, Matthijs T.J. Spaan

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical Analysis We evaluate our method based on a wide variety of TASE benchmarks.
Researcher Affiliation Academia Qisong Yang, Matthijs T. J. Spaan Delft University of Technology, The Netherlands {q.yang, m.t.j.spaan}@tudelft.nl
Pseudocode Yes Algorithm 1: Constrained Entropy Maximization
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Finally, we test our method in a set of continuous control, high-dimensional environments from the Safety Gym suite (Todorov, Erez, and Tassa 2012; Ray, Achiam, and Amodei 2019): Point Goal (36D, Figure 1(d)), Car Button (56D, Figure 1(e)).
Dataset Splits No The paper does not specify exact split percentages or absolute sample counts for training, validation, and test datasets, nor does it reference predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments.
Software Dependencies No The paper mentions 'Open AI Gym' and 'Safety Gym suite' but does not provide specific version numbers for these or any other ancillary software components required for replication.
Experiment Setup Yes We initialize the policy network as a multi-layer perceptron (MLP) with two hidden layers, each containing 256 units with ReLU activations. The value and cost networks are also MLPs with two hidden layers of 256 units. The training is conducted over 1000 epochs, and each epoch contains 2000 environment steps. The learning rate for the policy and value networks is 3e-4, and the learning rate for the Lagrangian multiplier is 5e-4. The discount factor γ is set to 0.99. The policy update is 20 times for each epoch, and the value update is 5 times. The mini-batch size is 256. We set the trust region threshold δ as 0.01. The number of trajectories N = 20 for Basic Nav, and N = 10 for the other environments. We set k = 50 for the k-NN estimator.