Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

System 1.x: Learning to Balance Fast and Slow Planning with Language Models

Authors: Swarnadeep Saha, Archiki Prasad, Justin Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with two diverse planning tasks Maze Navigation and Blocksworld show that our System-1.x Planner outperforms a System-1 Planner, a System-2 Planner trained to approximate A search, and also a symbolic planner (A search), given a state exploration budget.
Researcher Affiliation	Academia	Swarnadeep Saha Archiki Prasad Justin Chih-Yao Chen Peter Hase Elias Stengel-Eskin Mohit Bansal UNC Chapel Hill
Pseudocode	Yes	Algorithm 1 Training Data Generation for System-1.x Controller
Open Source Code	Yes	Code available at https://github.com/swarna Hub/System-1.x
Open Datasets	Yes	REPRODUCIBILITY STATEMENT We are making our code and data available in the supplementary material to enable replication of our findings. We randomly generate a balanced dataset of 4K planning problems (split into 3200/400/400 samples) with 5x5 mazes, 40% of the cells containing obstacles, and having optimal plan lengths between 1 to 8. Following the data creation algorithm in Bohnet et al. (2024), we generate problems consisting of 4-7 blocks (without repetition).
Dataset Splits	Yes	We randomly generate a balanced dataset of 4K planning problems (split into 3200/400/400 samples) with 5x5 mazes... From there, we create a train/validation/test split of 3000/250/200 samples where the train and the validation split consist of samples with plan lengths 1-6 and the test split consists of samples with plan lengths 7-10.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	Yes	We choose Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as the base LLM and fine-tune all our components with Lo RA (Hu et al., 2021) with a rank of 8 for a maximum of 3 epochs and a batch size of 4, resulting in three adapters for System-1, System-2, and the controller.
Experiment Setup	Yes	We choose Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as the base LLM and fine-tune all our components with Lo RA (Hu et al., 2021) with a rank of 8 for a maximum of 3 epochs and a batch size of 4, resulting in three adapters for System-1, System-2, and the controller.