Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization
Authors: Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang, Bin Wang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our empirical results. We first give examples to illustrate the test-time adaptation. Then we evaluate DROP against prior offline RL algorithms on the D4RL benchmark. Finally, we provide the computation cost regarding the test-time adaptation protocol. |
| Researcher Affiliation | Collaboration | Jinxin Liu1,2 Hongyin Zhang1,2 Zifeng Zhuang1,2 Yachen Kang1,2 Donglin Wang1 Bin Wang3 1Westlake University 2Zhejiang University 3Huawei Noah s Ark Lab |
| Pseudocode | Yes | We now summarize the DROP algorithm (see Algorithm 1 for the training phase and Algorithm 2 for the testing phase). |
| Open Source Code | Yes | We provide our source code in the supplementary material. |
| Open Datasets | Yes | We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. |
| Dataset Splits | Yes | We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. ... We evaluate our results over 5 seeds. For each seed, instead of taking the final checkpoint model produced by a training loop, we take the last T (T = 6 in our experiments) checkpoint models, and evaluate them over 10 episodes for each checkpoint. |
| Hardware Specification | Yes | The experiments were run on a computational cluster with 22x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 20 days. |
| Software Dependencies | No | The paper states 'Our code is based on d3rlpy' but does not provide specific version numbers for d3rlpy or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | In Table 7, we provide the hyper-parameters of the task embedding ϕ(z|s), the contextual behavior policy β(a|s, z), and the score function f(s, a, z). ... For the gradient ascent update steps (used for embedding inference), we set K = 100 for all the embedding inference rules in experiments. |