Conceptual Reinforcement Learning for Language-Conditioned Tasks
Authors: Shaohui Peng, Xing Hu, Rui Zhang, Jiaming Guo, Qi Yi, Ruizhi Chen, Zidong Du, Ling Li, Qi Guo, Yunji Chen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Verified in two challenging environments, RTFM and Messenger, CRL significantly improves the training efficiency (up to 70%) and generalization ability (up to 30%) to the new environment dynamics. To verify the performance of CRL, we evaluate the framework on two challenging benchmarks, RTFM and Messenger. |
| Researcher Affiliation | Collaboration | Shaohui Peng1, 2, 3, Xing Hu1, Rui Zhang1, 3, Jiaming Guo1, 2, 3, Qi Yi1, 3, 4, Ruizhi Chen2, 5, Zidong Du1, 3, Ling Li2, 5,, Qi Guo1, Yunji Chen1, 2, 1 SKL of Processors, Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences 3 Cambricon Technologies 4 University of Science and Technology of China 5 SKL of Computer Science, Institute of Software, CAS |
| Pseudocode | No | The paper describes the model architecture and mathematical formulations, but it does not include a distinct pseudocode block or algorithm section. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We verify CRL on two challenge environments with textual descriptions, RTFM (Zhong et al. 2020) and Messenger (Hanjie et al. 2021), both of which are benchmarks to evaluate the generalization ability of language-conditioned policy to new environment dynamics. |
| Dataset Splits | Yes | RTFM has a train set of environment dynamics (including entities and role assignments) and an independently identically distribution (i.i.d.) held-out test set. The Messenger offers three difficulty stages, including only message acquiring or delivering (S1), both acquiring and delivering (S2), and adding decoy entities and irrelevant descriptions (S3). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions several components like MLP, GRU, CLUB, and Deep VIB but does not specify the version numbers of any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | Table 1: RTFM results in different settings. All results get from 5 random seeds. LCRL(θ) = LRL(θ) + α1LCLUB(θ) + α2LV IB(θ), where LRL(θ) is the original RL objective, and coefficients α1, α2 are hyperparameters. (Details are in Appendix B). The details of the environment and the CRL implementation are shown in Appendix A and B. |