Learning to Mitigate AI Collusion on Economic Platforms

Authors: Gianluca Brero, Eric Mibuari, Nicolas Lepore, David C. Parkes

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our learning approach via three main experiments. We first consider performance in terms of consumer surplus, benchmarking our RL interventions against the ones introduced by Johnson et al. (2021). We demonstrate the ability to learn optimal leader strategies in the Stackelberg game with the followers across all the seeds we tested, significantly outperforming existing interventions.
Researcher Affiliation Academia Gianluca Brero Data Science Initiative Brown University gianluca_brero@brown.edu Eric Mibuari School of Engineering and Applied Sciences Harvard University mibuari@g.harvard.edu Nicolas Lepore School of Engineering and Applied Sciences Harvard University nlepore33@gmail.com David C. Parkes School of Engineering and Applied Sciences Harvard University parkes@g.harvard.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks; the methodology is described in narrative text.
Open Source Code No We will include it in the supplemental material.
Open Datasets No The paper describes a simulated platform economy for its experiments, rather than using an external publicly available dataset. 'As in Calvano et al. (2020a) and Johnson et al. (2021), we consider settings with two pricing agents with cost c = 1, quality indexes 1 = 2 = 2, and 0 = 0, and we set parameter µ = 0.25 to control horizontal differentiation.'
Dataset Splits No The paper describes simulation steps ('50k equilibrium steps and 30 reward steps', 'train our policies for 50 million steps in total') but does not refer to traditional training, validation, or test dataset splits, as its experiments are based on a simulated environment.
Hardware Specification Yes This coarsened price grid allows us to train a platform policy through Stackelberg POMDP for 50 million steps in 18 hours using a single core on a Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz machine.
Software Dependencies Yes To train the platform policy, we start from the A2C algorithm provided by Stable Baselines3 (Raffin et al., 2021, MIT License).
Experiment Setup Yes The seller Q-learning algorithms are also trained using discount factor δ = 0.95, exploration rate "t = e βt with β = 1e 5, and learning rate = 0.15. We set up the Stackelberg POMDP environment using 50k equilibrium steps and 30 reward steps. In these initial experiments, we train our policies for 50 million steps in total.