Learning to Communicate Implicitly by Actions
Authors: Zheng Tian, Shihao Zou, Ian Davies, Tim Warr, Lisheng Wu, Haitham Bou Ammar, Jun Wang7261-7268
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on a set of environments including a matrix game, particle environment and the non-competitive bidding problem from contract bridge. We show empirically that this auxiliary reward is effective and easy to generalize. These results demonstrate that our PBL algorithm can produce strong pairs of agents in collaborative games where explicit communication is disabled. Our experiments show that agents trained using PBL can learn collaborative behaviors more effectively than a number of meaningful baselines without requiring any explicit communication. We conduct a complete ablation study to analyze the effectiveness of different components within PBL in our bridge experiment. |
| Researcher Affiliation | Collaboration | Zheng Tian,1 Shihao Zou,1 Ian Davies,1 Tim Warr,1 Lisheng Wu,1 Haitham Bou Ammar,1,2 Jun Wang1 1University College London, 2Huawei R&D UK |
| Pseudocode | Yes | Algorithm 1 Per-Agent Policy Belief Learning (PBL) |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code or include links to a code repository. |
| Open Datasets | No | The paper mentions using a 'pregenerated test data set which contains 30,000 games' and refers to prior work for the matrix game ('This game is first proposed in (Foerster et al. 2018)') and the particle environment ('modify a multi-agent particle environment (Lowe et al. 2017)'). However, it does not provide concrete access information (e.g., specific links, DOIs, or clear statements of public availability with citations) for the training datasets or the environments used. |
| Dataset Splits | No | The paper mentions a 'pregenerated test data set which contains 30,000 games' and '6 training runs', but it does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions the use of a 'policy gradient algorithm' and 'NNs', but it does not provide specific software dependencies or their version numbers (e.g., Python version, specific deep learning frameworks like PyTorch or TensorFlow, or library versions). |
| Experiment Setup | Yes | Initially, in the absence of a belief module, we pre-train a policy π[0] naively by ignoring the existence of other agents in the environment. We apply a policy gradient algorithm with a shaped reward of the form:2 r = re + αrc, (4), where re is the reward from the environment, rc is the communication reward and α 0 balances the communication and environment rewards. In the distributed setting, we train separate belief modules for Guide and Listener respectively. |