Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Authors: Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang, Bin Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our empirical results. We first give examples to illustrate the test-time adaptation. Then we evaluate DROP against prior offline RL algorithms on the D4RL benchmark. Finally, we provide the computation cost regarding the test-time adaptation protocol.
Researcher Affiliation Collaboration Jinxin Liu1,2 Hongyin Zhang1,2 Zifeng Zhuang1,2 Yachen Kang1,2 Donglin Wang1 Bin Wang3 1Westlake University 2Zhejiang University 3Huawei Noah s Ark Lab
Pseudocode Yes We now summarize the DROP algorithm (see Algorithm 1 for the training phase and Algorithm 2 for the testing phase).
Open Source Code Yes We provide our source code in the supplementary material.
Open Datasets Yes We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8.
Dataset Splits Yes We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. ... We evaluate our results over 5 seeds. For each seed, instead of taking the final checkpoint model produced by a training loop, we take the last T (T = 6 in our experiments) checkpoint models, and evaluate them over 10 episodes for each checkpoint.
Hardware Specification Yes The experiments were run on a computational cluster with 22x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 20 days.
Software Dependencies No The paper states 'Our code is based on d3rlpy' but does not provide specific version numbers for d3rlpy or any other software dependencies used in the experiments.
Experiment Setup Yes In Table 7, we provide the hyper-parameters of the task embedding ϕ(z|s), the contextual behavior policy β(a|s, z), and the score function f(s, a, z). ... For the gradient ascent update steps (used for embedding inference), we set K = 100 for all the embedding inference rules in experiments.