Bi-Level Actor-Critic for Multi-Agent Coordination

Authors: Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, Jun Wang7325-7332

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find a asymmetric solution in a highway merge environment.
Researcher Affiliation Collaboration 1University College London 2Shanghai Jiao Tong University 3Huawei R&D UK
Pseudocode No The paper describes update rules using mathematical equations (e.g., Eq. 8-11, 12-16) and textual descriptions, but no explicitly labeled 'Pseudocode' or 'Algorithm' block is provided.
Open Source Code Yes Our experiments are repeatable and the source code is provided in https://github.com/laonahongchen/Bilevel-Optimization-in-Coordination-Game.
Open Datasets Yes We used a slightly modified version of the Highway environment (Leurent 2018), in which an agent can observe the kinematics of the nearby agent including its position and velocity... An environment for autonomous driving decision-making. https://github.com/eleurent/highway-env.
Dataset Splits No The paper describes the training process and exploration strategies but does not specify explicit train/validation/test dataset splits, percentages, or sample counts for any of the environments used.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments.
Software Dependencies No The paper mentions software components and methods used (e.g., 'Gumbel-Softmax estimator', 'DQN') but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup No The paper describes general aspects of the experimental setup such as using a 'three layer fully connected neural network with Re LU' and 'decaying ϵ-greedy method for explorations', and mentions 'learning rate αi, β', but does not provide concrete hyperparameter values (e.g., specific learning rates, batch sizes, number of epochs) or other detailed system-level training settings.