reproducibilityindex.ai

Eventual Discounting Temporal Logic Counterfactual Experience Replay

Authors: Cameron Voloshin, Abhinav Verma, Yisong Yue

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments, conducted in both discrete and continuous stateaction spaces, confirm the effectiveness of our counterfactual experience replay approach.
Researcher Affiliation	Collaboration	1Caltech 2Penn State 3Latitude AI. Correspondence to: Cameron Voloshin <cvoloshin@caltech.edu>.
Pseudocode	Yes	Algorithm 1 Learning with LCER, Algorithm 2 LCER for Q-learning, Algorithm 3 LCER for Policy Gradient, Algorithm 4 LCER for Policy Gradient (Option 2)
Open Source Code	Yes	1Code here: https://github.com/clvoloshin/RL-LTL
Open Datasets	No	The paper describes custom environments (Minecraft, Pacman, Flatworld, Carlo) used for experiments, but it does not provide concrete access information (links, DOIs, formal citations) for these environments as publicly available datasets.
Dataset Splits	No	The paper describes experimental setups and training procedures but does not explicitly mention train/validation/test dataset splits or their specific percentages/counts. The experiments are conducted in simulated environments through online interaction.
Hardware Specification	No	The paper does not provide specific hardware details (such as GPU or CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions using algorithms and optimizers like 'Q-learning', 'PPO', 'Adam optimizer', and 'DDQN' but does not specify their version numbers or the versions of any underlying programming languages or libraries like Python or PyTorch/TensorFlow.
Experiment Setup	Yes	A.2. Experiment Setup Each experiment is run with 10 random seeds. Results from Figure 2 are from an average over the seeds. Q-learning experiments. Let k be the greatest number of jump transitions available in some LDBA state k = maxb SB \|AB(b)\|. Let m = maxs SM \|AM(s)\|. The neural network Qθ(s) takes as input s SM and outputs R(m+k) \|SB\| a (m + k)-dim vector for each b SB. For our purposes, we consider Qθ(s, b) to be the single (m + k)-dim vector cooresponding to the particular current state of the LDBA b. When SM is discrete then we parametrize Qθ(s, b) as a table. Otherwise, Qθ(s, b) is parameterized by 3 linear layers with hidden dimension 128 with intermediary Re LU activations and no final activation. After masking for how many jump transitions exist in b, we can select arg maxi [0,...,\|AB(b)\|] Qθ(s, b)i the highest Q-value with probability 1 η and uniform with η probability. Here, η is initialized to η0 and decays linearly (or exponentially) at some specified frequency (see Table 2). At each episode (after a rollout of length T), we perform K gradient steps with different batches of size given in Table 3. We use Adam optimizer (Kingma & Ba, 2015) with a learning rate also specified by the table. When in a continuous state space, we implement DDQN (Hasselt et al., 2016) (rather than DQN) with a target network that gets updated at some frequency specified by Table 3.