reproducibilityindex.ai

Contrastive Reinforcement Learning of Symbolic Reasoning Domains

Authors: Gabriel Poesia, WenXin Dong, Noah Goodman

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce ﬁve environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (Con Po Le) that explicitly optimizes the Info NCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. Con Po Le successfully solves all four domains. Moreover, problem representations learned by Con Po Le enable accurate prediction of the categories of problems in a real mathematics curriculum. Table 2: Success rate of all agents in the Common Core environments. Agents were ran with 3 random seeds for 107 environment steps, and tested every 100k steps on a held-out set of 200 problems.
Researcher Affiliation	Academia	Gabriel Poesia Stanford University poesia@cs.stanford.edu Wen Xin Dong Stanford University wxd@stanford.edu Noah Goodman Stanford University ngoodman@stanford.edu
Pseudocode	Yes	Algorithm 1: Contrastive Policy Learning (Con Po Le)
Open Source Code	Yes	Code for agents and Common Core environments is available at https: //github.com/gpoesia/socratic-tutor.
Open Datasets	No	Problems in the equations domain come from a set of 290 syntactic equation templates (with placeholders for constants, which we sample between -10 and 10.) extracted from the Cognitive Tutor Algebra [28] dataset. Other environments use generators we describe in the Appendix. To investigate this question, we collected a dataset of equations from the Khan Academy4 educational portal. While these sources are mentioned, the paper does not provide concrete access information (link, DOI, repository) for the specific datasets they generated or collected for their experiments.
Dataset Splits	No	The paper mentions training and testing on a held-out set, but does not explicitly provide details for a separate validation set split.
Hardware Specification	Yes	Each agent was trained for 107 steps in each environment; runs took from 24 to 36 hours on a single NVIDIA Titan Xp GPU.
Software Dependencies	No	Our Common Core environments are implemented in Rust, and a simple high-throughput API is available for Python. The paper mentions programming languages but does not provide specific version numbers for software dependencies or libraries used for the experiments.
Experiment Setup	Yes	Each agent was trained for 107 steps in each environment; runs took from 24 to 36 hours on a single NVIDIA Titan Xp GPU. In our implementation, we simply increase the maximum search depth by 1 every K problems solved, up to a ﬁxed maximum depth. We train Con Po Le for 107 steps on a single GPU, with a training beam size of 1000, on cubes scrambled with up to 20 random moves.