Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach

Authors: woohyeon Byeon, Giseung Park, Jongseong Chae, Amir Leshem, Youngchul Sung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the convergence behavior of the proposed algorithm in tabular settings, and our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
Researcher Affiliation Academia 1School of Electrical Engineering, KAIST, Republic of Korea, 2University of Toronto Robotics Institute, Canada, 3Faculty of Engineering, Bar-Ilan University, Israel. *Correspondence to: Youngchul Sung <EMAIL>.
Pseudocode Yes Summarizing the above, the pseudo-codes of the proposed algorithms are in Appendix E. Our source code is provided at https://github.com/whbyeon/ERAM-ARAM. E Pseudo Codes Algorithm 1 Adversary with regularizer for max-min MORL employing exact policy evaluation Algorithm 2 ERAM employing approximate policy evaluation Algorithm 3 ARAM with PPO for the Learner Update
Open Source Code Yes Summarizing the above, the pseudo-codes of the proposed algorithms are in Appendix E. Our source code is provided at https://github.com/whbyeon/ERAM-ARAM.
Open Datasets Yes To evaluate the effectiveness of our algorithm in real-world multi-objective problems, we conducted experiments in the traffic signal control simulation environment [4]. At a four-road intersection, the agent controlled traffic signals based on traffic state information and received a reward vector composed of the negative total waiting times. Rewards were defined either per road (4 objectives) or per lane (16 objectives). More experimental results on other environments such as the species conservation environment [51], MO-Reacher environment [20], and Four-Room environment [20] are provided in Appendix N, showing the superior max-min performance of our algorithms.
Dataset Splits Yes The simulation includes 10,000 vehicles and is trained for 100,000 time steps. The scenario uses 4,000 vehicles and is also trained for 100,000 time steps. This scenario includes 4,000 vehicles and is trained for 200,000 time steps. We evaluated the max-min performance as the minimum of the empirical return vector, i.e., mink ห†Rk = 1 N PN i=1 P t ฮณtri k(st, at), averaged over N = 32 episodes and five random seeds.
Hardware Specification Yes All experiments were conducted independently on the same hardware to ensure a fair comparison. See Table 2 for a full summary. In addition, all experiments were conducted on a machine equipped with two Intel Xeon Gold 6238R CPUs.
Software Dependencies No We use the default network architecture and optimizer settings provided by Stable-Baselines3 [43]. For entries with multiple values, the best-performing one was selected based on validation performance.
Experiment Setup Yes The PPO hyperparameters are listed in the table below. We use the default network architecture and optimizer settings provided by Stable-Baselines3 [43]. For entries with multiple values, the best-performing one was selected based on validation performance. Table 6: PPO hyperparameters for ERAM and ARAM Table 7: Selected hyperparameters for ERAM and ARAM in each traffic scenario