Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

State Entropy Regularization for Robust Reinforcement Learning

Authors: Yonatan Ashlag, Uri Koren, Mirco Mutti, Esther Derman, Pierre-Luc Bacon, Shie Mannor

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 5, we empirically evaluate the robustness properties of state entropy regularization across discrete and continuous control tasks. We show that it improves performance under spatially correlated perturbations — specifically obstacle placement — while not degrading performance under smaller, more uniform perturbations. We also demonstrate how the robustness benefits of state entropy depend on the rollout budget, with diminished gains in low-sample regimes.
Researcher Affiliation Academia Yonatan Ashlag Technion Uri Koren Technion Mirco Mutti Technion Esther Derman MILA Institute Pierre-Luc Bacon MILA Institute Shie Mannor Technion, NVIDIA Research
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. The methodology is described in prose and mathematical formulations.
Open Source Code Yes Answer: [Yes] Justification: We will share a link to a github with the code used to run the experiments in the supplementary material.
Open Datasets No The paper mentions using 'Mini Grid' [7] and 'Mujoco' [57] as environments for experiments. These are widely recognized simulation environments or platforms, not datasets in the traditional sense. While their underlying assets might be accessible, the paper does not specify a particular dataset being used that would require explicit access information or a citation for data itself.
Dataset Splits No The paper describes experimental setups for environments like Mini Grid and Mujoco, mentioning evaluation over a certain number of episodes (e.g., "evaluate each policy for 200 episodes") but does not provide specific training/test/validation dataset splits with percentages or sample counts, as these are typically simulation environments where data is generated dynamically during training and evaluation, rather than pre-split datasets.
Hardware Specification Yes All experiments were conducted on a machine equipped with an NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions using 'A2C' [36] for Mini Grid and 'PPO' [49] for Mujoco, and that the code for Pusher is based on 'Clean RL codebase' [20]. However, it does not specify version numbers for these software components or any other key libraries/frameworks like Python, PyTorch, TensorFlow, etc., which would be necessary for full reproducibility.
Experiment Setup Yes For all the types, we select the largest regularization coefficients that degrade nominal performance less than 5%. To ensure a fair comparison, the methods are trained with the same base algorithm A2C [36] for Mini Grid and PPO [49] for Mujoco. To implement state entropy regularization, we largely follow [50] by using a k-nearest neighbor (k-NN) entropy estimator [52]. As state entropy regularization is incorporated as an intrinsic reward, it can be paired with any RL algorithm. Full implementation details are in Appx. B.1.