reproducibilityindex.ai

Augmenting Decision with Hypothesis in Reinforcement Learning

Authors: Nguyen Minh Quang, Hady W. Lauw

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical and empirical studies show evidence that it suffers from low exploitation in early training period and bias sensitiveness. To address these issues, we propose to augment the decision-making process with hypothesis, a weak form of environment description. Our approach relies on prompting the learning agent with accurate hypotheses, and designing a ready-to-adapt policy through incremental learning. We propose the ALH algorithm, showing detailed analyses on a typical learning scheme and a diverse set of Mujoco benchmarks.
Researcher Affiliation	Academia	1School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore 178902.
Pseudocode	Yes	Algorithm 1 Adaptive rollout; Algorithm 2 Empirical ALH algorithm
Open Source Code	Yes	Our code is available at https://github.com/nbtpj/ALH.
Open Datasets	Yes	We train TD3 agents, current SOTA of value-based RL, on two introduced schemes on 10 trials over two million steps. The detailed experiments and benchmarks are described in Section 4. To better monitor the value-based RL agents, we introduce a simple simulation environment named Multi Norm Env: S = [0, 600]; A = [ 6, 6]; T (s, a) = s + a... We compare our algorithm against TD3... on a set of eight Mu Jo Co continuous control tasks (Todorov et al., 2012).
Dataset Splits	Yes	We evaluate the policy every 5000 training steps.
Hardware Specification	Yes	We run our experiments on Linux environment with 56 CPUs, 8 Nvidia RTX2080Ti.
Software Dependencies	No	The paper mentions software like TD3, MBPO, DDPG, PPO, and PyTorch, but it does not specify version numbers for these software components (e.g., "pytorch-based implementation of PPO in Barhate (2021)" mentions PyTorch but no version number).
Experiment Setup	Yes	In all our reported experiments, we use δmem = 10, d H = 64, Bmini = B 2 , σ = 1. We adopt B = 256 in Mujoco tasks7, and B = 512 for a quick coverage in Multi Norm Env. For a fair comparison to the learning agent TD3, noise factors σ, e, c, hyper-parameters δpolicy, τ are adopted from TD3 author implementation.