A Strongly Asymptotically Optimal Agent in General Environments

Authors: Michael K. Cohen, Elliot Catt, Marcus Hutter

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments in grid-worlds to compare the Inquisitive Reinforcement Learner to other weakly asymptotically optimal agents. 5 Experimental Results We compared Inq with other known weakly asymptotically optimal agents, Thompson sampling and Bayes Exp [Lattimore and Hutter, 2014a], in the grid-world environment using AIXIjs [Aslanides, 2017] which has previously been used to compare asymptotically optimal agents [Aslanides et al., 2017]. We tested in 10 10 grid-worlds, and 20 20 gridworlds, both with a single dispenser with probability of dispensing reward 0.75; that is, if the agent enters that cell, the probability of a reward of 1 is 0.75. Following the conventions of [Aslanides et al., 2017] we averaged over 50 simulations, used discount factor γ = 0.99, 600 MCTS samples, and planning horizon of 6.
Researcher Affiliation Academia Michael K. Cohen , Elliot Catt and Marcus Hutter Australian National University {michael.cohen, elliot.carpentercatt, marcus.hutter}@anu.edu.au
Pseudocode Yes Algorithm 1 Inquisitive Reinforcement Learner s Policy
Open Source Code Yes The code used for this experiment is available online at https: //github.com/ejcatt/aixijs, and this version of Inq can be run in the browser at https://ejcatt.github.io/aixijs/demo.html#inq.
Open Datasets Yes We compared Inq with other known weakly asymptotically optimal agents, Thompson sampling and Bayes Exp [Lattimore and Hutter, 2014a], in the grid-world environment using AIXIjs [Aslanides, 2017] which has previously been used to compare asymptotically optimal agents [Aslanides et al., 2017]. We tested in 10 10 grid-worlds, and 20 20 gridworlds, both with a single dispenser with probability of dispensing reward 0.75; that is, if the agent enters that cell, the probability of a reward of 1 is 0.75.
Dataset Splits No The paper describes experiments in simulated grid-world environments but does not specify traditional training, validation, or test dataset splits in terms of percentages or sample counts, as is common for fixed datasets. The experimental setup involves continuous interaction within the simulated environment.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'AIXIjs [Aslanides, 2017]' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes Following the conventions of [Aslanides et al., 2017] we averaged over 50 simulations, used discount factor γ = 0.99, 600 MCTS samples, and planning horizon of 6. We found that using small values for , specifically 1 worked well. For our experiments we chose = 1.