Learning to Identify Top Elo Ratings: A Dueling Bandits Approach

Authors: Xue Yan, Yali Du, Binxin Ru, Jun Wang, Haifeng Zhang, Xu Chen8797-8805

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks. We consider the following two batteries of experiments to evaluate the performance of our algorithms in the scenarios of transitive and intransitive real world meta-games.
Researcher Affiliation Academia Xue Yan1,2, Yali Du * 3, Binxin Ru 4, Jun Wang 5, Haifeng Zhang1,2, Xu Chen6 1 Institute of Automation, Chinese Academy of Sciences, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3 Department of Informatics, King s College London, UK 4 Machine Learning Research Group, University of Oxford, UK 5 Department of Computer Science, University College London, UK 6 Gaoling School of Artificial Intelligence, Renmin University of China, China
Pseudocode Yes Algorithm 1: Max In-Elo: Dueling bandits with online SGD for top player identification.
Open Source Code No The paper does not provide a specific link or explicit statement about releasing the source code for the methodology described.
Open Datasets Yes We do our experiments on twelve real-games released by Czarnecki et al. (2020), most of which are implemented on the Open Spiel framework (Lanctot et al. 2019).
Dataset Splits No The paper mentions 'The batch size τ of Max In P, Max In Elo and Max In-m Elo is set to 0.7 n.' but does not specify training, validation, or test dataset splits (e.g., percentages, sample counts, or references to standard splits for the datasets used).
Hardware Specification Yes All experiments were run in a single x86 64 GNU/Linux machine with 256 AMD EPYC 7742 64-Core Processor and 2 A100 PCIe 40GB GPU.
Software Dependencies Yes We use sklearn(0.24.2) to solve the MLE.
Experiment Setup Yes Parameters setting For Random, DBGD, and RG-UCB baseline, we perform a grid search for the initial step size η in the range {0.01, 0.05, 0.1, 0.5, 1, 5, 10}. For RG-UCB, stopping confidence δ = 0.2. For Max In P, we tune the UCB balanced parameter γ {0.2, 0.4, 0.6, ..., 2.0}. For Max In Elo and Max In-m Elo, we tune the initialized learning rate η {0.01, 0.05, 0.1, 0.5, 1, 5, 10}, and the learning rate at batch j is set as η j . And the UCB balanced parameter γ {0.2, 0.4, 0.6, ..., 2.0}. The batch size τ of Max In P, Max In Elo and Max In-m Elo is set to 0.7 n.