Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LOPT: Learning Optimal Pigovian Tax in Sequential Social Dilemmas

Authors: Yun Hua, Shang Gao, Wenhao Li, Haosheng Chen, Bo Jin, Xiangfeng Wang, Jun Luo, Hongyuan Zha

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support LOPT with theoretical analysis and validate it on standard MARL benchmarks, including Escape Room and Cleanup. Results show that by effectively internalizing externalities that quantify social dilemmas, LOPT aligns individual objectives with collective goals, significantly improving social welfare over state-of-the-art baselines. Experiments in the Escape Room and challenging Cleanup environments demonstrate the effectiveness of the proposed mechanism in alleviating social dilemmas in MARL.
Researcher Affiliation Academia 1 Antai College of Economics and Management, Shanghai Jiao Tong University 2 School of Computer Science and Technology, East China Normal University 3 School of Computer Science and Technology, Tongji University 4 Key Laboratory of Mathematics and Engineering Applications (Mo E) 5 Shanghai Institute of AI for Education, East China Normal University 6 Shenzhen Loop Area Institute (SLAI) 7 School of Data Science, Chinese University of Hong Kong (Shenzhen) EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode Yes The typical learning process of LOPT is outlined in Algorithm 1 (Appendix), and its performance is demonstrated through experiments in the Escape Room and Cleanup environments.
Open Source Code No Due to limited time, the code has not been sorted out yet.
Open Datasets Yes We conduct experiments on both the ESCAPE ROOM [50] and the CLEANUP [15] environments, the details are summarized as follows:
Dataset Splits No For specific environments, we implemented various method combinations. In Escape Room, we compared LIO, LIO-dec, and Policy Gradient variants with discrete and continuous reward-giving actions (PG-d/c). The Cleanup(N = 2) evaluation included LIO, IA, MOA, SCM, and Actor-Critic variants (AC-d/c), while the more complex Cleanup(N = 5) scenario focused on MOA and SCM.
Hardware Specification Yes Training is conducted on a virtual machine hosted on a GPU server equipped with four NVIDIA GTX 2080 Ti GPUs, a 24-core CPU, and 32 GB of DRAM.
Software Dependencies No The paper does not provide specific software versions for key libraries or frameworks used in the implementation.
Experiment Setup Yes The settings of hyperparameters for baselines follow their previous work [15, 17, 50, 14]. For all experiments, the tuned hyperparameters of all baselines and LOPT are given in Table. 2-4 in the appendix D.2, where: α is the learning rate; αschedule is a list that contains the step and weight pairs for the learning rate scheduler; η is the weight for the entropy f(πp); ϵ in [50] decays linearly from ϵstart to ϵend by ϵdiv episodes; β is coefficient for the entropy of the policy.