Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration

Authors: Qisong Yang, Matthijs T.J. Spaan

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical Analysis We evaluate our method based on a wide variety of TASE benchmarks.
Researcher Affiliation	Academia	Qisong Yang, Matthijs T. J. Spaan Delft University of Technology, The Netherlands EMAIL
Pseudocode	Yes	Algorithm 1: Constrained Entropy Maximization
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	Finally, we test our method in a set of continuous control, high-dimensional environments from the Safety Gym suite (Todorov, Erez, and Tassa 2012; Ray, Achiam, and Amodei 2019): Point Goal (36D, Figure 1(d)), Car Button (56D, Figure 1(e)).
Dataset Splits	No	The paper does not specify exact split percentages or absolute sample counts for training, validation, and test datasets, nor does it reference predefined splits with citations for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments.
Software Dependencies	No	The paper mentions 'Open AI Gym' and 'Safety Gym suite' but does not provide specific version numbers for these or any other ancillary software components required for replication.
Experiment Setup	Yes	We initialize the policy network as a multi-layer perceptron (MLP) with two hidden layers, each containing 256 units with ReLU activations. The value and cost networks are also MLPs with two hidden layers of 256 units. The training is conducted over 1000 epochs, and each epoch contains 2000 environment steps. The learning rate for the policy and value networks is 3e-4, and the learning rate for the Lagrangian multiplier is 5e-4. The discount factor γ is set to 0.99. The policy update is 20 times for each epoch, and the value update is 5 times. The mini-batch size is 256. We set the trust region threshold δ as 0.01. The number of trajectories N = 20 for Basic Nav, and N = 10 for the other environments. We set k = 50 for the k-NN estimator.