Adaptive Exploration for Data-Efficient General Value Function Evaluations

Authors: Arushi Jain, Josiah Hanna, Doina Precup

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show our method s performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.
Researcher Affiliation Academia Arushi Jain arushi.jain@mail.mcgill.ca Mc Gill University Mila Josiah P. Hanna jphanna@cs.wisc.edu The University of Wisconsin Madison Doina Precup dprecup@cs.mcgill.ca Mc Gill University Mila
Pseudocode Yes We present GVFExplorer algorithm, detailed in Algorithm 1. Our approach uses two networks: Qθ for value function and Mw for variance, each with N heads (one head for each GVF).
Open Source Code Yes The code is available on Github: https://github.com/arushijain94/GVFExplorer.
Open Datasets No The paper utilizes custom-built or extended simulation environments (gridworld, Four Rooms, continuous grid environment, Mujoco domains) for experiments, where data is generated through agent-environment interactions. It does not refer to or provide access to a pre-existing, publicly available dataset in the traditional sense that would require a specific link or citation.
Dataset Splits No The paper focuses on online reinforcement learning where performance is evaluated during training. It discusses learning rates, batch sizes, and replay buffers but does not specify explicit train/validation/test dataset splits as commonly found in supervised learning, nor does it refer to predefined splits from external datasets.
Hardware Specification No The paper states, "All the experiments require less than 1GB of memory and have used combined compute less than total 4 CPU months and 1 GPU month." This provides an estimate of total compute resources but does not specify the types or models of CPUs, GPUs, or other hardware components used for running the experiments.
Software Dependencies No The paper mentions using algorithms like "Expected Sarsa" and "Soft Actor-Critic (SAC)" and techniques like "Prioritized Experience Replay (PER)" but does not provide specific version numbers for these or any other software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used.
Experiment Setup Yes Input: Target policies πi {1,...n}, initial behavior policy µ1, replay buffer D, primary networks Qθ, Mw (small non-zero M), target networks Q θ, M w, learning rates αQ, αM, mini-batch size b, trajectory length T, target update frequency l = 100, value/variance update frequencies p = 4, m = 8, training steps K, exploration rates ε0, εdecay, εmin