Adaptive Exploration for Data-Efficient General Value Function Evaluations
Authors: Arushi Jain, Josiah Hanna, Doina Precup
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show our method s performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs. |
| Researcher Affiliation | Academia | Arushi Jain arushi.jain@mail.mcgill.ca Mc Gill University Mila Josiah P. Hanna jphanna@cs.wisc.edu The University of Wisconsin Madison Doina Precup dprecup@cs.mcgill.ca Mc Gill University Mila |
| Pseudocode | Yes | We present GVFExplorer algorithm, detailed in Algorithm 1. Our approach uses two networks: Qθ for value function and Mw for variance, each with N heads (one head for each GVF). |
| Open Source Code | Yes | The code is available on Github: https://github.com/arushijain94/GVFExplorer. |
| Open Datasets | No | The paper utilizes custom-built or extended simulation environments (gridworld, Four Rooms, continuous grid environment, Mujoco domains) for experiments, where data is generated through agent-environment interactions. It does not refer to or provide access to a pre-existing, publicly available dataset in the traditional sense that would require a specific link or citation. |
| Dataset Splits | No | The paper focuses on online reinforcement learning where performance is evaluated during training. It discusses learning rates, batch sizes, and replay buffers but does not specify explicit train/validation/test dataset splits as commonly found in supervised learning, nor does it refer to predefined splits from external datasets. |
| Hardware Specification | No | The paper states, "All the experiments require less than 1GB of memory and have used combined compute less than total 4 CPU months and 1 GPU month." This provides an estimate of total compute resources but does not specify the types or models of CPUs, GPUs, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using algorithms like "Expected Sarsa" and "Soft Actor-Critic (SAC)" and techniques like "Prioritized Experience Replay (PER)" but does not provide specific version numbers for these or any other software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used. |
| Experiment Setup | Yes | Input: Target policies πi {1,...n}, initial behavior policy µ1, replay buffer D, primary networks Qθ, Mw (small non-zero M), target networks Q θ, M w, learning rates αQ, αM, mini-batch size b, trajectory length T, target update frequency l = 100, value/variance update frequencies p = 4, m = 8, training steps K, exploration rates ε0, εdecay, εmin |