reproducibilityindex.ai

Hybrid Reinforcement Learning with Expert State Sequences

Authors: Xiaoxiao Guo, Shiyu Chang, Mo Yu, Gerald Tesauro, Murray Campbell3739-3746

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our hybrid approach on an illustrative domain and Atari games. The empirical results show that (1) the agents are able to leverage state expert sequences to learn faster than pure reinforcement learning baselines, (2) our tensor-based action inference model is advantageous compared to standard deep neural networks in inferring expert actions, and (3) the hybrid policy optimization objective is robust against noise in expert state sequences.
Researcher Affiliation	Industry	Xiaoxiao Guo, Shiyu Chang, Mo Yu, Gerald Tesauro, Murray Campbell IBM Research AI {xiaoxiao.guo, shiyu.chang}@ibm.com, {yum, gtesauro, mcam}@us.ibm.com
Pseudocode	No	The paper describes methods in text and flowcharts (Figure 1), but does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper. It only references 'Openai baselines' (Dhariwal et al. 2017) which is a third-party library, not their own implementation code.
Open Datasets	Yes	We evaluate our proposed method on the Taxi domain (Dietterich 1998) and eight Atari games from Open AI Gym (Brockman et al. 2016).
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'Open AI Gym' and 'Advantage Actor-Critic (A2C)' but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup	Yes	A2C uses two forward step estimation for the advantage function. The state is represented as a one-hot vector of length 500 for both the actor and critic. The action inference model ﬁrst projects current states and next states to vectors of length 128. The matrices Mr and Nr of the action inference model are of size 128 x 128. The action inference model has rank 2. [...] The last four images are stacked in channel and rescaled to 84 x 84 as state input. The state encoding function is a four-layer convolutional neural network. The ﬁrst hidden layer convolves 32 8 x 8 ﬁlters with stride 4. The second layer convolves 64 4 x 4 ﬁlters with stride 2. The third layer convolves 32 3 x 3 ﬁlters with stride 1. The last layer of the state encoding function is fully-connected and consists of 512 output units. Each layer is followed by Re LU as nonlinearity. The matrices {Mr, Nr}r are all 128 x 128. The rank is set to be 8. We use pre-trained A2C agents with 5 million frames to generate 100 trajectories as demonstration state sequences. [...] we only sample from the ﬁrst K = 10 time steps of each trajectory at the beginning of learning. We gradually increase K by 1 for approximate 8,000 frames. [...] we only use the inferred actions after 100,000 frames to optimize the agent s policies when the action prediction model becomes reliable.