Augmenting Decision with Hypothesis in Reinforcement Learning

Authors: Nguyen Minh Quang, Hady W. Lauw

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and empirical studies show evidence that it suffers from low exploitation in early training period and bias sensitiveness. To address these issues, we propose to augment the decision-making process with hypothesis, a weak form of environment description. Our approach relies on prompting the learning agent with accurate hypotheses, and designing a ready-to-adapt policy through incremental learning. We propose the ALH algorithm, showing detailed analyses on a typical learning scheme and a diverse set of Mujoco benchmarks.
Researcher Affiliation Academia 1School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore 178902.
Pseudocode Yes Algorithm 1 Adaptive rollout; Algorithm 2 Empirical ALH algorithm
Open Source Code Yes Our code is available at https://github.com/nbtpj/ALH.
Open Datasets Yes We train TD3 agents, current SOTA of value-based RL, on two introduced schemes on 10 trials over two million steps. The detailed experiments and benchmarks are described in Section 4. To better monitor the value-based RL agents, we introduce a simple simulation environment named Multi Norm Env: S = [0, 600]; A = [ 6, 6]; T (s, a) = s + a... We compare our algorithm against TD3... on a set of eight Mu Jo Co continuous control tasks (Todorov et al., 2012).
Dataset Splits Yes We evaluate the policy every 5000 training steps.
Hardware Specification Yes We run our experiments on Linux environment with 56 CPUs, 8 Nvidia RTX2080Ti.
Software Dependencies No The paper mentions software like TD3, MBPO, DDPG, PPO, and PyTorch, but it does not specify version numbers for these software components (e.g., "pytorch-based implementation of PPO in Barhate (2021)" mentions PyTorch but no version number).
Experiment Setup Yes In all our reported experiments, we use δmem = 10, d H = 64, Bmini = B 2 , σ = 1. We adopt B = 256 in Mujoco tasks7, and B = 512 for a quick coverage in Multi Norm Env. For a fair comparison to the learning agent TD3, noise factors σ, e, c, hyper-parameters δpolicy, τ are adopted from TD3 author implementation.