Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

Authors: Lakshmi Nair, Ian Trase, J. Mark Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a novel reasoning approach called Flow-of-Options (Fo O), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an Fo O-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. Fo O enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% 69.2% on standard data science tasks, and 37.4% 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our Fo O-based agentic system to tasks such as reinforcement learning and image generation.
Researcher Affiliation	Industry	Lakshmi Nair 1 Ian Trase 1 J. Mark Kim 1 1Flagship Pioneering, Cambridge MA, USA. Correspondence to: Lakshmi Nair <EMAIL>.
Pseudocode	No	The paper describes the Flow-of-Options construction and traversal using text and mathematical notation, and provides architectural diagrams (Figures 1, 3, 5), but does not include a distinct, structured pseudocode or algorithm block.
Open Source Code	Yes	Our code is open-sourced at: https: //github.com/flagshippioneering/ Flow-of-Options.
Open Datasets	Yes	We evaluate our framework on 16 tasks obtained from (Guo et al., 2024). Our baselines include DSAgent (Guo et al., 2024), Auto Gluon (Erickson et al., 2020), SELA (Chi et al., 2024), Data Interpreter (DI) (Hong et al., 2024), Autogen (Wu et al., 2024), and zero-shot with Chainof-Thought (Co T) (Wei et al., 2022). We also evaluate on 17 ADME-Tox tasks using Therapeutic Data Commons (TDC) (Huang et al., 2021). Reinforcement Learning (Cartpole Balancing): We show results of our approach on the classic cartpole balancing problem from Open AI Gym (Brockman, 2016). Synthetic Image Generation using MNIST data: We show results of our framework on synthesis of MNIST images in Figure 10. We found that our framework ran into issues with the download and use of Opus-100 dataset using Hugging Face API.
Dataset Splits	Yes	Similar to DS-Agent, we retain a separate Dtrain, with testing on Dtest. Here is an example code snippet showing how to load and evaluate a dataset with the name Caco2_Wang": from tdc.benchmark_group import admet_group group = admet_group(path = data/ ) predictions_list = [] # For reproducibility for seed in [1, 2, 3, 4, 5]: benchmark = group.get( Caco2_Wang ) # all benchmark names in a benchmark group are stored in group.dataset_names predictions = {} name = benchmark[ name ] train_val, test = benchmark[ train_val ], benchmark[ test ] train, valid = group.get_train_valid_split(benchmark = name, split_type = default , seed = seed) # NOTE: For the dataset, column names are Drug (for the input SMILES strings) and Y (for the output labels)
Hardware Specification	No	The paper mentions using "GPT-4o" and running models on "GPU" in general, but does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments.
Software Dependencies	No	The paper mentions several software packages like "Scikit-learn (Sklearn)", "Lang Chain", "RDKit", "Pytorch", "Hugging Face", "cmath", "pandas", and "numpy", but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	For development, we use T = 5 iterations, j = 3 (walks per batch), k = 4 (DS) or k = 3 (TDC), and filter tasks to use n = 2 (DS) or n = 3 (TDC). For Tables 1 and 2, we start with an empty case bank, and disable CBR for all development tasks. Fo O for each development task is then added to the case bank together at the end. For deployment, we enable CBR Fo O are retrieved and reused from the case bank (no new options explored). Table 7: Hyperparameter settings of our framework. Task T (iters) n (Fo O depth) k (num options) j (walks/batch) RL 1 1 4 4 Image Generation 1 1 4 4 Clustering 3 3 3 3 Machine Translation 3 3 3 3 Traveling Salesman 1 1 3 3