ScriptWorld: Text Based Environment for Learning Procedural Knowledge
Authors: Abhinav Joshi, Areeb Ahmad, Umang Pandey, Ashutosh Modi
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide gaming environments for 10 daily activities and perform a detailed analysis of the proposed environment. We develop RL-based baseline models/agents to play the games in Script World. To understand the role of language models in such environments, we leverage features obtained from pre-trained language models in the RL agents. Our experiments show that prior knowledge obtained from a pre-trained language model helps to solve real-world text-based gaming environments. |
| Researcher Affiliation | Academia | Abhinav Joshi , Areeb Ahmad , Umang Pandey , Ashutosh Modi Indian Institute of Technology Kanpur (IIT-K) {ajoshi, ashutoshm}@cse.iitk.ac.in, {areeb, umangp}@iitk.ac.in |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We perform a detailed analysis of the proposed environment and release the environment and agents: https://github.com/ Exploration-Lab/Script World. |
| Open Datasets | Yes | Given the nature of Script knowledge, we use a scripts corpus referred to as De Script [Wanzare et al., 2016] for creating Script World environment. |
| Dataset Splits | No | The paper does not provide specific training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits. It describes agents learning through interaction with an environment rather than training on static dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like Python, GPT2, SBERT, DQN, A2C, PPO, and RPPO, but it does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | Rewards (Performance Scores): For all the scenarios, every incorrect action choice results in a negative reward of -1, and every correct choice returns a 0 reward. For task completion, the agent gets a reward of 10, i.e., a player gets a maximum reward of 10 at the end of each game if they choose a correct sequence of actions. The choice of zero rewards for correct action helps RL algorithms explore multiple correct ways of performing a task, capturing the generalized procedural knowledge required for a speciļ¬c task. The game terminates when an agent chooses 5 successive wrong actions. In our experiments, agents played with the environment with a back-hop distance of 1. We consider two settings in a game. 1) Number of choices: At each step, the number of choices presented to an agent can be changed (1 correct choice and the rest all incorrect). 2) Number of backward hops for wrong actions. |