Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation
Authors: Zhizhou Ren, Anji Liu, Yitao Liang, Jian Peng, Jianzhu Ma
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we extensively evaluate our method, Adaptation with Noisy Orac LE (ANOLE), on a variety of meta-RL benchmark tasks and demonstrate substantial improvement over baseline algorithms in terms of both feedback efficiency and error tolerance. 4 Experiments In this section, we investigate the empirical performance of ANOLE on a suite of Meta-RL benchmark tasks. We compare our method with simple preference-based adaptation strategies and conduct several ablation studies to demonstrate the effectiveness of our algorithmic designs. |
| Researcher Affiliation | Collaboration | 1Helixon Ltd. 2University of Illinois at Urbana-Champaign 3University of California, Los Angeles 4Institute for Artificial Intelligence, Peking University 5Beijing Institute for General Artificial Intelligence 6Institute for AI Industry Research, Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Adaptation with Noisy Orac LE (ANOLE) |
| Open Source Code | Yes | The source code of our ANOLE implementation and experiment scripts are available at https://github.com/Stilwell-Git/Adaptation-with-Noisy-Orac LE. |
| Open Datasets | Yes | We adopt six meta-RL benchmark tasks created by Rothfuss et al. (2019)... The preference feedback is simulated by summing step-wise rewards given by the Mu Jo Co-based environment simulator... |
| Dataset Splits | No | The paper discusses meta-training and meta-testing phases, but does not provide specific quantitative dataset splits (e.g., percentages or sample counts) for training, validation, or test data points for their experiments. While meta-RL involves training and testing on tasks, explicit data partitioning for reproducibility is not detailed. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using soft actor-critic (SAC; Haarnoja et al., 2018) and a meta-training procedure similar to PEARL (Rakelly et al., 2019), and Adam optimizer (Kingma and Ba, 2015). However, it does not specify version numbers for any of these software components, libraries, or programming languages used. |
| Experiment Setup | Yes | The adaptation algorithms are restricted to use K = 10 preference queries, and the noisy oracle would give wrong feedback for each query with probability ϵ = 0.2. We configure ANOLE with KE = 2 to compute Berlekamp s volume. The preference predictor fψ( ; z) is a 2-layer MLP with ReLU activation, and we train it with Adam optimizer (Kingma and Ba, 2015) using a batch size of 256. The learning rate for preference predictor is 0.001. |