Augmenting Decision with Hypothesis in Reinforcement Learning
Authors: Nguyen Minh Quang, Hady W. Lauw
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and empirical studies show evidence that it suffers from low exploitation in early training period and bias sensitiveness. To address these issues, we propose to augment the decision-making process with hypothesis, a weak form of environment description. Our approach relies on prompting the learning agent with accurate hypotheses, and designing a ready-to-adapt policy through incremental learning. We propose the ALH algorithm, showing detailed analyses on a typical learning scheme and a diverse set of Mujoco benchmarks. |
| Researcher Affiliation | Academia | 1School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore 178902. |
| Pseudocode | Yes | Algorithm 1 Adaptive rollout; Algorithm 2 Empirical ALH algorithm |
| Open Source Code | Yes | Our code is available at https://github.com/nbtpj/ALH. |
| Open Datasets | Yes | We train TD3 agents, current SOTA of value-based RL, on two introduced schemes on 10 trials over two million steps. The detailed experiments and benchmarks are described in Section 4. To better monitor the value-based RL agents, we introduce a simple simulation environment named Multi Norm Env: S = [0, 600]; A = [ 6, 6]; T (s, a) = s + a... We compare our algorithm against TD3... on a set of eight Mu Jo Co continuous control tasks (Todorov et al., 2012). |
| Dataset Splits | Yes | We evaluate the policy every 5000 training steps. |
| Hardware Specification | Yes | We run our experiments on Linux environment with 56 CPUs, 8 Nvidia RTX2080Ti. |
| Software Dependencies | No | The paper mentions software like TD3, MBPO, DDPG, PPO, and PyTorch, but it does not specify version numbers for these software components (e.g., "pytorch-based implementation of PPO in Barhate (2021)" mentions PyTorch but no version number). |
| Experiment Setup | Yes | In all our reported experiments, we use δmem = 10, d H = 64, Bmini = B 2 , σ = 1. We adopt B = 256 in Mujoco tasks7, and B = 512 for a quick coverage in Multi Norm Env. For a fair comparison to the learning agent TD3, noise factors σ, e, c, hyper-parameters δpolicy, τ are adopted from TD3 author implementation. |