Hybrid Reinforcement Learning with Expert State Sequences
Authors: Xiaoxiao Guo, Shiyu Chang, Mo Yu, Gerald Tesauro, Murray Campbell3739-3746
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our hybrid approach on an illustrative domain and Atari games. The empirical results show that (1) the agents are able to leverage state expert sequences to learn faster than pure reinforcement learning baselines, (2) our tensor-based action inference model is advantageous compared to standard deep neural networks in inferring expert actions, and (3) the hybrid policy optimization objective is robust against noise in expert state sequences. |
| Researcher Affiliation | Industry | Xiaoxiao Guo, Shiyu Chang, Mo Yu, Gerald Tesauro, Murray Campbell IBM Research AI {xiaoxiao.guo, shiyu.chang}@ibm.com, {yum, gtesauro, mcam}@us.ibm.com |
| Pseudocode | No | The paper describes methods in text and flowcharts (Figure 1), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. It only references 'Openai baselines' (Dhariwal et al. 2017) which is a third-party library, not their own implementation code. |
| Open Datasets | Yes | We evaluate our proposed method on the Taxi domain (Dietterich 1998) and eight Atari games from Open AI Gym (Brockman et al. 2016). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Open AI Gym' and 'Advantage Actor-Critic (A2C)' but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | A2C uses two forward step estimation for the advantage function. The state is represented as a one-hot vector of length 500 for both the actor and critic. The action inference model first projects current states and next states to vectors of length 128. The matrices Mr and Nr of the action inference model are of size 128 x 128. The action inference model has rank 2. [...] The last four images are stacked in channel and rescaled to 84 x 84 as state input. The state encoding function is a four-layer convolutional neural network. The first hidden layer convolves 32 8 x 8 filters with stride 4. The second layer convolves 64 4 x 4 filters with stride 2. The third layer convolves 32 3 x 3 filters with stride 1. The last layer of the state encoding function is fully-connected and consists of 512 output units. Each layer is followed by Re LU as nonlinearity. The matrices {Mr, Nr}r are all 128 x 128. The rank is set to be 8. We use pre-trained A2C agents with 5 million frames to generate 100 trajectories as demonstration state sequences. [...] we only sample from the first K = 10 time steps of each trajectory at the beginning of learning. We gradually increase K by 1 for approximate 8,000 frames. [...] we only use the inferred actions after 100,000 frames to optimize the agent s policies when the action prediction model becomes reliable. |