Select to Perfect: Imitating desired behavior from large multi-agent data

Authors: Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Nicolaus Foerster, Joao F. Henriques

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate how EVs can be estimated from fully-anonymized data and employ EV2BC (Def. 4.5) to learn policies aligned with the DVF, outperforming relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect. We run all experiments for five random seeds and report mean and standard deviation where applicable. For more details on the implementation, please refer to the Appendix. In the following experiments, we first evaluate EVs as a measure of an agent s contribution to a given DVF. We then assess the average estimation error for EVs as the number of observations in the dataset D decreases and how applying clustering decreases this error. We lastly evaluate the performance of Exchange Value based Behaviour Cloning (EV2BC, see Definition 4.5) for simulated and human datasets and compare to relevant baselines, such as standard Behavior Cloning (Pomerleau, 1991) and Offline Reinforcement Learning (Pan et al., 2022).
Researcher Affiliation Academia Tim Franzmeyer Edith Elkind Philip Torr Jakob Foerster Jo ao F. Henriques University of Oxford frtim@robots.ox.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Reproducibility. To help reproduce our work, we publish code on the project website at https://tinyurl.com/select-to-perfect.
Open Datasets Yes The Dhuman dataset was collected from humans playing the game (see Carroll et al. (2019)); it is fully anonymized with one-time-use agent identifiers, hence is a degenerate dataset (see Figure 2 bottom row). The Star Craft Multi-Agent Challenge (Samvelyan et al., 2019) is a cooperative multi-agent environment...
Dataset Splits No The paper refers to “dataset D” and “test set” in the context of evaluation, but does not explicitly provide details about train/validation/test dataset splits with percentages or counts for reproduction.
Hardware Specification Yes We used an Intel(R) Xeon(R) Silver 4116 CPU and an NVIDIA Ge Force GTX 1080 Ti (only for training BC, EV2BC, group-BC, and OMAR policies).
Software Dependencies No The paper mentions various software components and algorithms used (e.g., k-means, PCA, SLSQP, scikit-learn) but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes In accordance with the quantity of available data, we set the threshold parameter such that only agents with EVs above the 90th, 67th, and 50th percentile are imitated in To C, Starcraft, and Overcooked, respectively. We conducted a hyperparameter sweep for the following parameters: learning rate with options {0.01, 0.001, 0.0001}, Omar-coe with options {0.1, 1, 10}, Omar-iters with options {1, 3, 10}, and Omar-sigma with options {1, 2, 3}.