reproducibilityindex.ai

Reinforcement Learning in Newcomblike Environments

Authors: James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, softmax agents converge (to strongly ratiﬁable policies) in many NDPs, provided that the temperature decreases sufﬁciently slowly. To illustrate this we will use Aymmetric Death in Damascus, a version of Death in Damascus wherein the rewards of a Aleppo are changed to be 5 (instead of 0) with probability (a Aleppo) and (as before) 10 with the remaining probability. This NDP has only one (strongly) ratiﬁable policy, namely to go to Aleppo with probability 2/3 and Damascus with probability 1/3. This is also the optimal policy. We use this asymmetric version to make it easier to distinguish between convergence to the ratiﬁable policy and the default of uniform mixing at high temperatures. Figure 2 shows the probability of converging to this policy with a softmax agent and a plot of the policy on one run. We can see that this agent reliably converges provided that the cooling is sufﬁciently slow.
Researcher Affiliation	Academia	James Bell The Alan Turing Institute London, UK jbell@posteo.com Linda Linsefors Independent Researcher linda.linsefors@gmail.com Caspar Oesterheld Department of Computer Science Duke University Durham, NC, USA caspar.oesterheld@duke.edu Joar Skalse Department of Computer Science University of Oxford Oxford, UK joar.skalse@cs.ox.ac.uk
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes algorithms like Q-learning and SARSA in text.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. There is no mention of releasing code or links to repositories.
Open Datasets	No	The paper primarily discusses theoretical environments/problems (Newcomb's Problem, Death in Damascus, Repellor Problem, Loss-Averse Rock-Paper-Scissors) rather than using standard publicly available datasets with specific access information (links, DOIs, repositories, or formal citations). While it cites the origin of Newcomb's Problem and Death in Damascus, these are conceptual problems, not empirical datasets.
Dataset Splits	No	The paper does not provide specific dataset split information for training, validation, or testing. The experiments are conducted on theoretically defined environments (NDPs), not on pre-split empirical datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. No mention of any computing resources is made.
Software Dependencies	No	The paper mentions Q-learning, SARSA, and Expected SARSA as update rules, but does not specify their version numbers or the versions of any other software libraries or programming languages used (e.g., Python, PyTorch, etc.).
Experiment Setup	Yes	Figure 2: The left ﬁgure plots the probability of softmax converging in Asymmetric Death in Damascus given βn = n against . More accurately it is a plot of the fraction of runs which assigned a Q-value of at least 5.5 to the action of going to Aleppo after 5000 iterations. These are empirical probabilities from 20,000 runs for every that is a multiple of 0.025, and 510,000 runs for each that is a multiple of 0.005 between 0.5 and 0.55. (...) Figure 3: This ﬁgure shows ﬁve runs of a softmax agent in LARPS, and plots (arock) against the total number of episodes played. The agent s Q-values are the historical mean rewards for each action, and βt = 1/ log t. (...) Figure 4b depicts ﬁve runs of -Greedy in LARPS. (...) The agent s Q-values are the historical mean rewards for each action, and its -value is 0.01.