Distributional Meta-Gradient Reinforcement Learning

Authors: Haiyan Yin, Shuicheng YAN, Zhongwen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For empirical evaluation, we first present an illustrative example on a toy two-color grid-world domain, which validates the benefit of learning distributional return over expectation; then we conduct extensive comparisons on a large-scale RL benchmark Atari 2600, where we confirm that our proposed method with distributional return works seamlessly well with the actor-critic framework and leads to state-of-the-art median human normalized score among meta-gradient RL literature.
Researcher Affiliation Industry Haiyan Yin, Shuicheng Yan & Zhongwen Xu Sea AI Lab
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We also include the code for the two important algorithmic components, networks, and the two-level meta-update step function in our work.
Open Datasets Yes We testify our method on Atari 2600 Benchmark (Bellemare et al., 2013) under the 200M setting, where the setting aligns with prior works (Xu et al., 2018b; Zahavy et al., 2020).
Dataset Splits No The paper mentions 'training data' but does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit predefined split names for validation).
Hardware Specification No The paper states 'Each task runs with identical hyperparameters and device configurations for hardware and software' but does not provide specific details such as GPU/CPU models or memory.
Software Dependencies No The paper mentions frameworks like IMPALA and ResNet but does not provide specific software dependency versions (e.g., Python, PyTorch, TensorFlow versions) in the provided text.
Experiment Setup Yes For training, we strictly follow the 200M regime without extra data or experience reuse. ... All the games employ the same wrapper, for which the action repeat is 4, stick action probability is 0, episodic life is false, and the maximum episode length is 108,000. All the tasks adopt the no-ops starts protocol, where at the start of each episode, the agent randomly samples a period to take a dummy action 0 up to 30 steps.