Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Social World Model-Augmented Mechanism Design Policy Learning
Authors: Xiaoyuan Zhang, Yizhe Huang, Chengdong Ma, Zhixun Chen, Long Ma, Yali Du, Song-Chun Zhu, Yaodong Yang, Xue Feng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency. 4 Experiments 4.1 Facility Location We designed a facility location game to examine the effectiveness of the methodology... 4.2 Team Structure Optimization Team structure optimization... 4.3 Tax Adjustment In this experimental setup... Performance Analysis: We evaluate our proposed SWM-AP method against several baselines... Figure 3: Facility location performance analysis. (a) Comparison of sample efficiency and final converged performance. (b) State prediction loss and (c) Reward prediction loss curves for our SWM compared to baselines. |
| Researcher Affiliation | Academia | Xiaoyuan Zhang1,2,3 Yizhe Huang1,2 Chengdong Ma1,3 Zhixun Chen4 Long Ma5 Yali Du6 Song-Chun Zhu2,1,3 Yaodong Yang1,3 Xue Feng2 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China 3 State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China 4The Hong Kong University of Science and Technology (Guangzhou) 5Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University 6King s College London |
| Pseudocode | Yes | Algorithm 1 SWM-AP Learning framework 1: Initialize: Mechanism Design Policy Πθ(πt|sobs t , ˆmprior,t), Dynamic Model Mϕ(ˆsobs t+1|sobs t , πt, ˆmpost), Posterior Trait Tracker qφ( ˆmpost|τ), Prior Trait Tracker pξ( ˆmprior,t|Ht) where Ht = (sobs t, π<t, a<t), Environment, Model Datasets Denv, Dmodel 2: for NEpochs do 3: Collect real trajectories τ = (sobs 0 , π0, rsoc 0 , . . . , sobs T , πT , rsoc T ) in Environment using policy Πθ and Prior Trait Tracker pξ. Store in Denv. 4: Jointly train Posterior Trait Tracker qφ and Dynamic Model Mϕ on dataset Denv, using objective based on Equation 3, implicitly training qφ to produce ˆmpost. 5: Train Prior Trait Tracker pξ on dataset Denv , using objective based on Equation 4 to align pξ( |Ht) with qφ( |τ). 6: Generate imagined trajectories ˆτ = (ˆsobs 0 , π0, ˆrsoc 0 , . . . ) using Dynamic Model Mϕ, policy Πθ, and Posterior Trait Tracker pξ. Store in Dmodel. 7: Optimize policy Πθ using data from Denv and Dmodel , maximizing objective Equation 2 using PPO on combined data. 8: end for 9: Return: Policy Πθ, SWM pψ |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are currently organizing and refining our source code and experimental environments. We anticipate making them available at a later stage once the organization process is complete. |
| Open Datasets | Yes | Environment Setting: The environment consists of an 8 8 grid with four types of basic resources... We conduct the experiments in Ada Society [13]... In AI-Economist [39]... |
| Dataset Splits | No | The paper describes environment settings for Facility Location, Ada Society, and AI-Economist, including aspects like episode length and agent count. Reinforcement learning experiments typically involve training policies by interacting with an environment and collecting trajectories, rather than using predefined train/test/validation splits from a static dataset. The paper mentions collecting "real trajectories" and generating "imagined trajectories" which are typical for RL, but does not provide specific train/test/validation dataset splits in the conventional sense. |
| Hardware Specification | Yes | Table 3: Key Environment and Algorithm Configurations. General Training & Compute GPUs Used NVIDIA RTX 3090 (1 per run) NVIDIA RTX 3090 (1 per run) NVIDIA A100 (1 per run) |
| Software Dependencies | No | The paper mentions optimizers like Adam and algorithms such as PPO, Dreamer, and MBPO. However, it does not specify any version numbers for these software components, libraries, or programming languages used in the implementation. |
| Experiment Setup | Yes | Table 3: Key Environment and Algorithm Configurations. Category Parameter Facility Location Team Structure Optimization Tax Adjustment Environment Specifics Env. Source Matrix Ada Society AI Economist Agent Count 8 4 4 Latent Trait Count. 256 4 4 Mechanism Action Select a point from Map(8*8) Assign a team structures among 14 different types Set a tax rate for each of the 7 tax brackets Episode Length 5 50 1000 (for agents), 10 (for planner) SWM-AP: Social World Model (SWM) Latent Inference Arch. MLP (L:2, H:512) MLP (L:2, H:512) + LSTM (L:1, H:512) + MLP (L:2, H:512) MLP (L:2, H:512) + LSTM (L:1, H:512) + MLP (L:2, H:512) Dynamics Predict. Arch. MLP (L:3, H:256) GCN (L:3, H:[64, 128]) + MLP (L:2, H:128) GCN (L:3, H:[64, 128]) + MLP (L:2, H:128) SWM Optimizer & LR Adam, 10 3 Adam, 10 3 Adam, 10 3 SWM-AP: Mechanism Design Policy (PPO based) Policy/Value Arch. MLP (L:2, H:128) GCN (L:3, H:[64, 64]) + MLP (L:2, H:256) MLP (L:2, H:256) + LSTM(L:1,H:256) + MLP (L:1,H:256) Optimizer & LR Adam, 2.5 10 4 Adam, 5 10 4 Adam, 1 10 4 Discount (γ) 0.99 0.99 0.99 Imagined Rollout (SWM) 5 steps 50 steps 1000 steps Baselines: PPO Policy/Value Arch. MLP (L:2, H:128) GCN (L:3, H:[64, 64]) + MLP (L:2, H:256) MLP (L:2, H:256) + LSTM(L:1,H:256) + MLP (L:1,H:256) Optimizer & LR Adam, 2.5 10 4 Adam, 5 10 4 Adam, 1 10 4 Discount (γ) 0.99 0.99 0.99 Baselines: MBPO Policy/Value Arch. MLP (L:2, H:128) GCN (L:3, H:[64, 64]) + MLP (L:2, H:256) GCN(L:3, H:[64, 64]) + MLP (L:2, H:256) Optimizer & LR Adam, 2.5 10 4 Adam, 5 10 4 Adam, 5 10 4 Discount (γ) 0.99 0.99 0.99 General Training & Compute Total Timesteps 106 1 108 5 108 Num. Random Seeds 3 3 3 Error Bars SEM over 3 runs SEM over 3 runs SEM over 3 runs |