Eventual Discounting Temporal Logic Counterfactual Experience Replay

Authors: Cameron Voloshin, Abhinav Verma, Yisong Yue

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, conducted in both discrete and continuous stateaction spaces, confirm the effectiveness of our counterfactual experience replay approach.
Researcher Affiliation Collaboration 1Caltech 2Penn State 3Latitude AI. Correspondence to: Cameron Voloshin <cvoloshin@caltech.edu>.
Pseudocode Yes Algorithm 1 Learning with LCER, Algorithm 2 LCER for Q-learning, Algorithm 3 LCER for Policy Gradient, Algorithm 4 LCER for Policy Gradient (Option 2)
Open Source Code Yes 1Code here: https://github.com/clvoloshin/RL-LTL
Open Datasets No The paper describes custom environments (Minecraft, Pacman, Flatworld, Carlo) used for experiments, but it does not provide concrete access information (links, DOIs, formal citations) for these environments as publicly available datasets.
Dataset Splits No The paper describes experimental setups and training procedures but does not explicitly mention train/validation/test dataset splits or their specific percentages/counts. The experiments are conducted in simulated environments through online interaction.
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions using algorithms and optimizers like 'Q-learning', 'PPO', 'Adam optimizer', and 'DDQN' but does not specify their version numbers or the versions of any underlying programming languages or libraries like Python or PyTorch/TensorFlow.
Experiment Setup Yes A.2. Experiment Setup Each experiment is run with 10 random seeds. Results from Figure 2 are from an average over the seeds. Q-learning experiments. Let k be the greatest number of jump transitions available in some LDBA state k = maxb SB |AB(b)|. Let m = maxs SM |AM(s)|. The neural network Qθ(s) takes as input s SM and outputs R(m+k) |SB| a (m + k)-dim vector for each b SB. For our purposes, we consider Qθ(s, b) to be the single (m + k)-dim vector cooresponding to the particular current state of the LDBA b. When SM is discrete then we parametrize Qθ(s, b) as a table. Otherwise, Qθ(s, b) is parameterized by 3 linear layers with hidden dimension 128 with intermediary Re LU activations and no final activation. After masking for how many jump transitions exist in b, we can select arg maxi [0,...,|AB(b)|] Qθ(s, b)i the highest Q-value with probability 1 η and uniform with η probability. Here, η is initialized to η0 and decays linearly (or exponentially) at some specified frequency (see Table 2). At each episode (after a rollout of length T), we perform K gradient steps with different batches of size given in Table 3. We use Adam optimizer (Kingma & Ba, 2015) with a learning rate also specified by the table. When in a continuous state space, we implement DDQN (Hasselt et al., 2016) (rather than DQN) with a target network that gets updated at some frequency specified by Table 3.