Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents
Authors: Wenhan Xiong, Xiaoxiao Guo, Mo Yu, Shiyu Chang, Bowen Zhou, William Yang Wang
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our scheduled policy optimization method on the Blocks environment originally created by Bisk et al. [2016]. There are 20 unique blocks in the environment and the goal of the agent is to accomplish natural language described tasks by moving blocks in the 2D map. The dataset consists of 11,871 training samples and 1,179/3,177 samples for validation/testing. |
| Researcher Affiliation | Collaboration | Wenhan Xiong1, Xiaoxiao Guo2, Mo Yu2, Shiyu Chang2, Bowen Zhou3, William Yang Wang1, 1 University of California, Santa Barbara 2 IBM Research 3 JD AI Research |
| Pseudocode | Yes | Algorithm 1: Scheduled Policy Optimization Algorithm |
| Open Source Code | Yes | Code and trained models can be found at https://github. com/xwhan/walk_the_blocks. |
| Open Datasets | Yes | We evaluate our scheduled policy optimization method on the Blocks environment originally created by Bisk et al. [2016]. |
| Dataset Splits | Yes | The dataset consists of 11,871 training samples and 1,179/3,177 samples for validation/testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' and 'PPO' but does not specify version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow versions. |
| Experiment Setup | Yes | The initial learning rate is 0.0001 and is divided by 2 for every 4 epochs. The windowed history consists of the execution errors of the last 100 trials. The clipping interval of PPO is set to [0.95, 1.05] and the number of PPO epochs for each update step is set to be 4. We restrict the number of training epochs to be less than 20. Early-stopping is applied using the Dev set. |