End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations

Authors: Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the efficacy of our approach on nine Atari tasks and present GPT-generated explanations for policies and decisions.
Researcher Affiliation Academia 1 School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI.
Pseudocode Yes Algorithm 1 Procedures for learning symbolic policy
Open Source Code No The paper mentions a project page 'ins-rl.github.io' but does not explicitly state that the source code for the methodology is released or available at this link. The criteria require an unambiguous statement of code release or a direct link to a code repository.
Open Datasets No For each task, we first roll out about 10,000 frames using pre-trained neural agents and then extract the object bounding boxes...to form a frame-symbol dataset Dsymbol. ... Conforming to the standard supervised training approach, the dataset is divided into training and test sets in an 80:20 ratio. The paper describes the generation of their own dataset (Dsymbol) from environment interactions, and while it uses known environments (Atari), the generated dataset itself is not stated as publicly available or linked for access.
Dataset Splits Yes Conforming to the standard supervised training approach, the dataset is divided into training and test sets in an 80:20 ratio.
Hardware Specification Yes The hardware setup comprised an AMD Ryzen 9 5950X 16-Core Processor for CPU, an NVIDIA Ge Force RTX 3090 Ti as the graphics card, and 24564Mi B of video memory.
Software Dependencies No The paper mentions software components like PPO, Adam optimizer, GPT-4, Fast SAM, and De Aot models, but it does not provide specific version numbers for these components, which are required for reproducible software dependency description.
Experiment Setup Yes As for hyper-parameters, we use 0.001 for λreg, 2 for λcnn. ... The pretraining stage employed the loss function detailed in Eq. (3), spanning 600 epochs with a batch size of 32. ... The learning rate was established at 3 * 10^-4, with the Adam optimizer. ... PPO with a learning rate of 2.5e-4 uses the collected rewards to optimize the model. The batchsize of each update is 1024.