A Direct Approximation of AIXI Using Logical State Abstractions

Authors: Samuel Yang-Zhao, Tianyu Wang, Kee Siong Ng

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on controlling epidemics on large-scale contact networks validates the agent s performance.
Researcher Affiliation Academia Samuel Yang-Zhao Australian National University Canberra ACT 2601 samuel.yang-zhao@anu.edu.au Tianyu Wang Australian National University Canberra ACT 2601 tianyu.wang2@anu.edu.au Kee Siong Ng Australian National University Canberra ACT 2601 keesiong.ng@anu.edu.au
Pseudocode No The paper describes algorithms and processes but does not include any pseudocode or algorithm blocks.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes We use an email network dataset as the underlying contact network, licensed under a Creative Commons Attribution-Share Alike License, containing 1133 nodes and 5451 edges [44, 45].
Dataset Splits No The paper does not explicitly specify dataset splits (e.g., training, validation, test percentages or counts) or cross-validation methods.
Hardware Specification Yes All experiments were performed on a 12-Core AMD Ryzen Threadripper 1920x processor and 32 gigabytes of memory.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes The transition model, observation model, Action_Cost(at) are parametrised the same way across all experiments (see Table 1 in Appendix B). A Quarantine(i) action imparts a cost of 1 per node that is quarantined at the given time step. A Vaccinate(i, j) action imparts a lower cost of 0.5 per node. The parameters λ, 1, 2 are varied across experiments. We generate a set of 1489 predicate functions... The Φ-AIXI-CTW agent is trained in an online fashion. The agent explores with probability t at each step t until t < 0.03, where the agent performs in an -greedy way with exploration rate 0.03. RF-BDD was performed with a threshold value of 0.9 across all rewards.