Contrastive Reinforcement Learning of Symbolic Reasoning Domains
Authors: Gabriel Poesia, WenXin Dong, Noah Goodman
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce five environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (Con Po Le) that explicitly optimizes the Info NCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. Con Po Le successfully solves all four domains. Moreover, problem representations learned by Con Po Le enable accurate prediction of the categories of problems in a real mathematics curriculum. Table 2: Success rate of all agents in the Common Core environments. Agents were ran with 3 random seeds for 107 environment steps, and tested every 100k steps on a held-out set of 200 problems. |
| Researcher Affiliation | Academia | Gabriel Poesia Stanford University poesia@cs.stanford.edu Wen Xin Dong Stanford University wxd@stanford.edu Noah Goodman Stanford University ngoodman@stanford.edu |
| Pseudocode | Yes | Algorithm 1: Contrastive Policy Learning (Con Po Le) |
| Open Source Code | Yes | Code for agents and Common Core environments is available at https: //github.com/gpoesia/socratic-tutor. |
| Open Datasets | No | Problems in the equations domain come from a set of 290 syntactic equation templates (with placeholders for constants, which we sample between -10 and 10.) extracted from the Cognitive Tutor Algebra [28] dataset. Other environments use generators we describe in the Appendix. To investigate this question, we collected a dataset of equations from the Khan Academy4 educational portal. While these sources are mentioned, the paper does not provide concrete access information (link, DOI, repository) for the specific datasets they generated or collected for their experiments. |
| Dataset Splits | No | The paper mentions training and testing on a held-out set, but does not explicitly provide details for a separate validation set split. |
| Hardware Specification | Yes | Each agent was trained for 107 steps in each environment; runs took from 24 to 36 hours on a single NVIDIA Titan Xp GPU. |
| Software Dependencies | No | Our Common Core environments are implemented in Rust, and a simple high-throughput API is available for Python. The paper mentions programming languages but does not provide specific version numbers for software dependencies or libraries used for the experiments. |
| Experiment Setup | Yes | Each agent was trained for 107 steps in each environment; runs took from 24 to 36 hours on a single NVIDIA Titan Xp GPU. In our implementation, we simply increase the maximum search depth by 1 every K problems solved, up to a fixed maximum depth. We train Con Po Le for 107 steps on a single GPU, with a training beam size of 1000, on cubes scrambled with up to 20 random moves. |