Reinforcement Learning in Newcomblike Environments

Authors: James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, softmax agents converge (to strongly ratifiable policies) in many NDPs, provided that the temperature decreases sufficiently slowly. To illustrate this we will use Aymmetric Death in Damascus, a version of Death in Damascus wherein the rewards of a Aleppo are changed to be 5 (instead of 0) with probability (a Aleppo) and (as before) 10 with the remaining probability. This NDP has only one (strongly) ratifiable policy, namely to go to Aleppo with probability 2/3 and Damascus with probability 1/3. This is also the optimal policy. We use this asymmetric version to make it easier to distinguish between convergence to the ratifiable policy and the default of uniform mixing at high temperatures. Figure 2 shows the probability of converging to this policy with a softmax agent and a plot of the policy on one run. We can see that this agent reliably converges provided that the cooling is sufficiently slow.
Researcher Affiliation Academia James Bell The Alan Turing Institute London, UK jbell@posteo.com Linda Linsefors Independent Researcher linda.linsefors@gmail.com Caspar Oesterheld Department of Computer Science Duke University Durham, NC, USA caspar.oesterheld@duke.edu Joar Skalse Department of Computer Science University of Oxford Oxford, UK joar.skalse@cs.ox.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes algorithms like Q-learning and SARSA in text.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There is no mention of releasing code or links to repositories.
Open Datasets No The paper primarily discusses theoretical environments/problems (Newcomb's Problem, Death in Damascus, Repellor Problem, Loss-Averse Rock-Paper-Scissors) rather than using standard publicly available datasets with specific access information (links, DOIs, repositories, or formal citations). While it cites the origin of Newcomb's Problem and Death in Damascus, these are conceptual problems, not empirical datasets.
Dataset Splits No The paper does not provide specific dataset split information for training, validation, or testing. The experiments are conducted on theoretically defined environments (NDPs), not on pre-split empirical datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. No mention of any computing resources is made.
Software Dependencies No The paper mentions Q-learning, SARSA, and Expected SARSA as update rules, but does not specify their version numbers or the versions of any other software libraries or programming languages used (e.g., Python, PyTorch, etc.).
Experiment Setup Yes Figure 2: The left figure plots the probability of softmax converging in Asymmetric Death in Damascus given βn = n against . More accurately it is a plot of the fraction of runs which assigned a Q-value of at least 5.5 to the action of going to Aleppo after 5000 iterations. These are empirical probabilities from 20,000 runs for every that is a multiple of 0.025, and 510,000 runs for each that is a multiple of 0.005 between 0.5 and 0.55. (...) Figure 3: This figure shows five runs of a softmax agent in LARPS, and plots (arock) against the total number of episodes played. The agent s Q-values are the historical mean rewards for each action, and βt = 1/ log t. (...) Figure 4b depicts five runs of -Greedy in LARPS. (...) The agent s Q-values are the historical mean rewards for each action, and its -value is 0.01.