Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning

Authors: Byungchan Ko, Jungseul Ok

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiment Train and test tasks. We use the Open AI Procgen benchmark of 16 video games [5], where a main character tries to achieve a specific goal, e.g., finding exit (Maze) or collecting coins (Coinrun), while avoiding enemies given a 2D map. At each time t, visual observation ot is given as an image of size 64 64. A train or test task is to achieve a high score on a set of environments configured by game and mode, where a mode describes predefined sets of levels (e.g., complexity of map) and backgrounds. Cobbe et al. [5] provide easy mode for each game, consisting of 200 levels and a certain set of backgrounds. ... All results in the main paper are averaged over five runs.
Researcher Affiliation Collaboration Byungchan Ko NALBI kbc@nalbi.ai Jungseul Ok GSAI, POSTECH jungseul@postech.ac.kr This work was done while Byungchan Ko studied in GSAI, POSTECH.
Pseudocode Yes Algorithm 1 In DA Require: N, I, ϕ, S, T 1: Initialize θ close to origin. 2: for n = 1, 2, . . . , N do 3: // RL training 4: Store sampled transitions to D; 5: Optimize RL objective LPPO(θ) with D; 6: // Distillation 7: if n [S, T] and mod(n 1, I) = 0 then 8: Store θold θ; 9: Minimize LDA(θ) for D, θold and ϕ; 10: end if 11: end for
Open Source Code Yes https://github.com/kbc6723/es-da
Open Datasets Yes We use the Open AI Procgen benchmark of 16 video games [5]
Dataset Splits No We simplify easy mode and train agents in easybg mode, of which the only difference from easy mode [5] is showing only a single background. ... Then, we evaluate generalization capabilities using two modes: test-bg and test-lv, which contain unseen backgrounds and levels, respectively, in addition to easybg mode that we use for training. The paper describes training on 'easybg' mode and evaluating on 'test-bg' and 'test-lv' modes, which serve as test sets for generalization. It does not explicitly mention a separate 'validation' data split for hyperparameter tuning or model selection in the traditional supervised learning sense.
Hardware Specification No The main paper text does not specify hardware details such as GPU/CPU models or specific compute resources used for experiments. It states in the ethics checklist, 'We explain about training time in the supplementary material,' implying these details are not in the main body.
Software Dependencies No The paper mentions using 'Proximal Policy Optimization (PPO) [27] as a baseline' but does not specify any software versions for PPO, other libraries, or programming languages used.
Experiment Setup No The paper describes the experimental setup in terms of methods (In DA, Ex DA, UCB-Ex DA), tasks, and augmentations. It refers to hyperparameters like N, I, S, T, M in algorithms and 'c is the UCB exploration coefficient', but explicitly states, 'We refer to the supplementary material for the hyperparameter choice,' indicating that specific numerical values are not provided in the main text.