Learning Intuitive Policies Using Action Features

Authors: Mingwei Ma, Jizhou Liu, Samuel Sokota, Max Kleiman-Weiner, Jakob Nicolaus Foerster

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluate the architectures in the hint-guess game introduced in Section 4. In Sections 6.2-6.4, we fix the hand size to be N = 5 and the features to be F1 = {1, 2, 3} and F2 = {A, B, C}. ... First, we evaluate model cross-play (XP) performance for each architecture in the intra-AXP setting. ... We recruited 10 university students to play hint-guess.
Researcher Affiliation Collaboration Mingwei Ma * 1 Jizhou Liu * 2 Samuel Sokota * 3 Max Kleiman-Weiner 4 Jakob Foerster 5 *Equal contribution 1Ubiquant Investment, work done while at University of Chicago 2Booth School of Business, University of Chicago 3Carnegie Mellon University 4Harvard University 5FLAIR, University of Oxford. Correspondence to: Mingwei Ma <mwma@ubiquant.com>, Samuel Sokota <ssokota@andrew.cmu.edu>.
Pseudocode No The paper describes its model architectures and training setup in detail within the text and Appendix A.1, and includes diagrams like Fig. 3 for model architecture. However, it does not feature any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about making its source code publicly available, nor does it provide a link to a code repository.
Open Datasets No The paper introduces a novel game called 'hint-guess' and uses 'procedurally generated coordination tasks' for its experiments. It does not use a pre-existing publicly available dataset and does not provide access information for any custom-generated dataset.
Dataset Splits No The paper mentions training agents using 'randomly initialized games' and 'randomly generated games' and evaluates performance, but it does not specify any fixed train/validation/test splits, percentages, or the methodology for partitioning data for validation purposes.
Hardware Specification Yes All experiments were run on two computing nodes with 256GB of memory and a 28-Core Intel 2.4GHz CPU.
Software Dependencies No The paper mentions using 'standard experience-replay', 'mean squared error loss', 'stochastic gradient descent', and refers to 'IQL (Tan, 1993)'. It also describes architectures using 'Re LU activations' and 'dot-product attention'. However, it does not provide specific version numbers for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software components used in the implementation.
Experiment Setup Yes By default we use standard experience-replay with a replay memory of size 300K. For optimization, we use the mean squared error loss, stochastic gradient descent with learning rate set to 10 4 and minibatches of size 500 each. We train the agents using 4M episodes. To allow more data to be collected between training steps, we update the network only after we receive 500 new observations rather than after every observation. We use the standard exponential decay scheme with exploration rate ϵ = ϵm + (ϵ0 ϵm) exp( n/K), where n is the number of episodes, ϵm = 0.01, ϵ0 = 0.95, and K = 50, 000.