On Trajectory Augmentations for Off-Policy Evaluation
Authors: Ge Gao, Qitong Gao, Xi Yang, Song Ju, Miroslav Pajic, Min Chi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work was empirically evaluated in a wide array of environments, encompassing both simulated scenarios and realworld domains like robotic control, healthcare, and e-learning, where the training trajectories include varying levels of coverage of the state-action space. |
| Researcher Affiliation | Collaboration | North Carolina State University, USA. Emails: {ggao5, mchi}@ncsu.edu. Duke University, USA. Emails: {qitong.gao, miroslav.pajic}@duke.edu. IBM Research, USA. Email: xi.yang@ibm.com. |
| Pseudocode | No | The paper includes mathematical formulations and descriptions of its components (e.g., VAE-MDP in Section 2.2) but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper mentions using 'open-sourced code provided by the authors for the generative and time-series augmentation methods' (Section 3, Baselines) for baseline comparisons, but it does not state that the code for its proposed method (OAT) is open-source or provide a link to its own code. |
| Open Datasets | Yes | We follow the experimental settings provided in Deep OPE benchmark, with 11 DAPG-based evaluation policies ranging from random to expert performance (Fu et al., 2021). |
| Dataset Splits | No | The paper describes training and testing data splits for some environments (e.g., '80% (...) for training and (...) 20% for test' in Section 3.2 for Sepsis, and 'six semesters as the training data (...) and test on the upcoming semester' for Intelligent Tutor), and mentions hyperparameter tuning (Appendix B.2) which implies a validation set, but it does not explicitly state the specific proportions or methodology for a dedicated validation dataset split. |
| Hardware Specification | Yes | Training of our method and baselines are supported by four NVIDIA TITAN Xp 12GB, three NVIDIA Quadro RTX 6000 24GB, and four NVIDIA RTX A5000 24GB GPUs. |
| Software Dependencies | No | The paper states, 'We implement the proposed method in Python.' (Appendix B.1) and mentions 'Adam optimizer is used to perform gradient descent.' (Appendix B.2), but it does not provide specific version numbers for Python, deep learning frameworks (e.g., TensorFlow, PyTorch), or other key libraries/dependencies used. |
| Experiment Setup | Yes | maximum number of iteration is set to 100 and minibatch size set to 4 (given the small numbers of trajectories, i.e., 25 for each task) in Adroit, and 1,000 and 64 for real-world healthcare and e-learning, respectively. Adam optimizer is used to perform gradient descent. To determine the learning rate, we perform grid search among {1e 4, 3e 3, 3e 4, 5e 4, 7e 4}. Exponential decay is applied to the learning rate, which decays the learning rate by 0.997 every iteration. |