Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning
Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations on standard benchmarks (Mu Jo Co, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges, offering a promising direction for improving robustness in DRL under adversarial conditions. |
| Researcher Affiliation | Academia | Roman Belaire Singapore Management University Singapore EMAIL Arunesh Sinha Rutgers University New Brunswick, NJ EMAIL Pradeep Varakantham Singapore Management University Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1: δ-PPO |
| Open Source Code | Yes | Our code is available at https://github.com/romanbelaire/acoe-robust-rl. |
| Open Datasets | Yes | Our empirical evaluations on standard benchmarks (Mu Jo Co, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches |
| Dataset Splits | No | We report the mean result over 5 policies initialized with random seeds, with 50 test episodes each. |
| Hardware Specification | Yes | We train our linear models on an NVIDIA Tesla V100 with 16gb of memory, and LSTM models on an NVIDIA L40 32gb GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions algorithms and frameworks like PPO, DQN, Adam, and LSTM, but not their specific software implementations or versions. |
| Experiment Setup | Yes | We train our methods for 900 episodes for all Mu Jo Co environments, using an annealed (Adam) learning rate of 0.005. The robustness hyperparameter λ is set to 0.2 for all of our models, which is the same as the robustness hyperparameters found in prior works Oikarinen et al. (2021); Liang et al. (2022); Belaire et al. (2024); Zhang et al. (2020). The attack neighborhood sample size is set to 10, and the training attack neighborhood radius is set to ϵ = 0.1, both tuned from sets in the range 100%. |