Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation

Authors: Zhizhou Ren, Anji Liu, Yitao Liang, Jian Peng, Jianzhu Ma

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we extensively evaluate our method, Adaptation with Noisy Orac LE (ANOLE), on a variety of meta-RL benchmark tasks and demonstrate substantial improvement over baseline algorithms in terms of both feedback efficiency and error tolerance. 4 Experiments In this section, we investigate the empirical performance of ANOLE on a suite of Meta-RL benchmark tasks. We compare our method with simple preference-based adaptation strategies and conduct several ablation studies to demonstrate the effectiveness of our algorithmic designs.
Researcher Affiliation Collaboration 1Helixon Ltd. 2University of Illinois at Urbana-Champaign 3University of California, Los Angeles 4Institute for Artificial Intelligence, Peking University 5Beijing Institute for General Artificial Intelligence 6Institute for AI Industry Research, Tsinghua University
Pseudocode Yes Algorithm 1 Adaptation with Noisy Orac LE (ANOLE)
Open Source Code Yes The source code of our ANOLE implementation and experiment scripts are available at https://github.com/Stilwell-Git/Adaptation-with-Noisy-Orac LE.
Open Datasets Yes We adopt six meta-RL benchmark tasks created by Rothfuss et al. (2019)... The preference feedback is simulated by summing step-wise rewards given by the Mu Jo Co-based environment simulator...
Dataset Splits No The paper discusses meta-training and meta-testing phases, but does not provide specific quantitative dataset splits (e.g., percentages or sample counts) for training, validation, or test data points for their experiments. While meta-RL involves training and testing on tasks, explicit data partitioning for reproducibility is not detailed.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using soft actor-critic (SAC; Haarnoja et al., 2018) and a meta-training procedure similar to PEARL (Rakelly et al., 2019), and Adam optimizer (Kingma and Ba, 2015). However, it does not specify version numbers for any of these software components, libraries, or programming languages used.
Experiment Setup Yes The adaptation algorithms are restricted to use K = 10 preference queries, and the noisy oracle would give wrong feedback for each query with probability ϵ = 0.2. We configure ANOLE with KE = 2 to compute Berlekamp s volume. The preference predictor fψ( ; z) is a 2-layer MLP with ReLU activation, and we train it with Adam optimizer (Kingma and Ba, 2015) using a batch size of 256. The learning rate for preference predictor is 0.001.