reproducibilityindex.ai

Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents

Authors: Wenhan Xiong, Xiaoxiao Guo, Mo Yu, Shiyu Chang, Bowen Zhou, William Yang Wang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our scheduled policy optimization method on the Blocks environment originally created by Bisk et al. [2016]. There are 20 unique blocks in the environment and the goal of the agent is to accomplish natural language described tasks by moving blocks in the 2D map. The dataset consists of 11,871 training samples and 1,179/3,177 samples for validation/testing.
Researcher Affiliation	Collaboration	Wenhan Xiong1, Xiaoxiao Guo2, Mo Yu2, Shiyu Chang2, Bowen Zhou3, William Yang Wang1, 1 University of California, Santa Barbara 2 IBM Research 3 JD AI Research
Pseudocode	Yes	Algorithm 1: Scheduled Policy Optimization Algorithm
Open Source Code	Yes	Code and trained models can be found at https://github. com/xwhan/walk_the_blocks.
Open Datasets	Yes	We evaluate our scheduled policy optimization method on the Blocks environment originally created by Bisk et al. [2016].
Dataset Splits	Yes	The dataset consists of 11,871 training samples and 1,179/3,177 samples for validation/testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU/CPU models or memory specifications.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'PPO' but does not specify version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow versions.
Experiment Setup	Yes	The initial learning rate is 0.0001 and is divided by 2 for every 4 epochs. The windowed history consists of the execution errors of the last 100 trials. The clipping interval of PPO is set to [0.95, 1.05] and the number of PPO epochs for each update step is set to be 4. We restrict the number of training epochs to be less than 20. Early-stopping is applied using the Dev set.