Self-Predictive Universal AI
Authors: Elliot Catt, Jordi Grau-Moya, Marcus Hutter, Matthew Aitchison, Tim Genewein, Grégoire Delétang, Kevin Li, Joel Veness
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While our work is mainly theoretical, we also conducted experiments (see Appendix B) comparing self-prediction, using a Self-AIXI approximation, against the pure planning approach, using an AIXI approximation, using Context Tree Weighting as predictor and Monte-Carlo Tree Search for the Q-value estimates. ... Results. Figure 1 shows learning curves for the Cheeze Maze, Tiger and 4x4 Grid domains respectively. ... The final performance (as evaluated by the average reward per step over the final 2000 timesteps) of each agent configuration is shown in Table 2. |
| Researcher Affiliation | Industry | Elliot Catt, Jordi Grau Moya, Marcus Hutter, Matthew Aitchison Tim Genewein, Gregoire Deletang, Kevin Li Wenliang, Joel Veness Google Deep Mind ecatt@google.com |
| Pseudocode | No | The paper provides definitions, theorems, and proofs but does not include any pseudocode or explicitly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluated across 5 stochastic, partially observable and history dependent domains: Cheeze Maze, Kuhn Poker, 4x4 Grid, Tiger and Biased Scissor/Paper/Rock. The description for each of these domains can once again be found in [20]. |
| Dataset Splits | No | The paper describes an online learning setup with a single run across timesteps and rolling average reward for evaluation, but does not provide explicit train/validation/test dataset splits or cross-validation details. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, or cloud computing resources) used for running the experiments. |
| Software Dependencies | No | The paper mentions techniques like Monte-Carlo Tree Search and Context Tree Weighting but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For all experiments, we use a finite m-horizon undiscounted return setup, with each domain specific horizon choice given by Table 3 in [20]. The CTW depth parameter to both the environment model (ˆξ := CTWd) and the self-prediction model (ˆζ := CTWd) was also chosen to match Table 3 in [20]. ... Each environment was evaluated by performing a single online run across 104 timesteps; ... At each timestep t, both agents pick either a random action with probability ϵt := 0.2 0.999t, or otherwise return the estimated best action according to action-value estimates computed with 500 simulations. |