Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization
Authors: Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang, Bin Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our empirical results. We first give examples to illustrate the test-time adaptation. Then we evaluate DROP against prior offline RL algorithms on the D4RL benchmark. Finally, we provide the computation cost regarding the test-time adaptation protocol. |
| Researcher Affiliation | Collaboration | Jinxin Liu1,2 Hongyin Zhang1,2 Zifeng Zhuang1,2 Yachen Kang1,2 Donglin Wang1 Bin Wang3 1Westlake University 2Zhejiang University 3Huawei Noah s Ark Lab |
| Pseudocode | Yes | We now summarize the DROP algorithm (see Algorithm 1 for the training phase and Algorithm 2 for the testing phase). |
| Open Source Code | Yes | We provide our source code in the supplementary material. |
| Open Datasets | Yes | We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. |
| Dataset Splits | Yes | We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. ... We evaluate our results over 5 seeds. For each seed, instead of taking the final checkpoint model produced by a training loop, we take the last T (T = 6 in our experiments) checkpoint models, and evaluate them over 10 episodes for each checkpoint. |
| Hardware Specification | Yes | The experiments were run on a computational cluster with 22x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 20 days. |
| Software Dependencies | No | The paper states 'Our code is based on d3rlpy' but does not provide specific version numbers for d3rlpy or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | In Table 7, we provide the hyper-parameters of the task embedding ϕ(z|s), the contextual behavior policy β(a|s, z), and the score function f(s, a, z). ... For the gradient ascent update steps (used for embedding inference), we set K = 100 for all the embedding inference rules in experiments. |