End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations
Authors: Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the efficacy of our approach on nine Atari tasks and present GPT-generated explanations for policies and decisions. |
| Researcher Affiliation | Academia | 1 School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI. |
| Pseudocode | Yes | Algorithm 1 Procedures for learning symbolic policy |
| Open Source Code | No | The paper mentions a project page 'ins-rl.github.io' but does not explicitly state that the source code for the methodology is released or available at this link. The criteria require an unambiguous statement of code release or a direct link to a code repository. |
| Open Datasets | No | For each task, we first roll out about 10,000 frames using pre-trained neural agents and then extract the object bounding boxes...to form a frame-symbol dataset Dsymbol. ... Conforming to the standard supervised training approach, the dataset is divided into training and test sets in an 80:20 ratio. The paper describes the generation of their own dataset (Dsymbol) from environment interactions, and while it uses known environments (Atari), the generated dataset itself is not stated as publicly available or linked for access. |
| Dataset Splits | Yes | Conforming to the standard supervised training approach, the dataset is divided into training and test sets in an 80:20 ratio. |
| Hardware Specification | Yes | The hardware setup comprised an AMD Ryzen 9 5950X 16-Core Processor for CPU, an NVIDIA Ge Force RTX 3090 Ti as the graphics card, and 24564Mi B of video memory. |
| Software Dependencies | No | The paper mentions software components like PPO, Adam optimizer, GPT-4, Fast SAM, and De Aot models, but it does not provide specific version numbers for these components, which are required for reproducible software dependency description. |
| Experiment Setup | Yes | As for hyper-parameters, we use 0.001 for λreg, 2 for λcnn. ... The pretraining stage employed the loss function detailed in Eq. (3), spanning 600 epochs with a batch size of 32. ... The learning rate was established at 3 * 10^-4, with the Adam optimizer. ... PPO with a learning rate of 2.5e-4 uses the collected rewards to optimize the model. The batchsize of each update is 1024. |