Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Feedback-Aware MCTS for Goal-Oriented Information Seeking

Authors: Harshita Chopra, Chirag Shah

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach across diverse conversational tasks. Results demonstrate that our system outperforms existing baselines in both task success and efficiency in scenarios requiring complex reasoning and hierarchical decision-making. We also highlight the individual contributions of depth-aware MCTS and cluster-based feedback in enhancing the systemโ€™s performance.
Researcher Affiliation Academia Harshita Chopra University of Washington, Seattle EMAIL Chirag Shah University of Washington, Seattle EMAIL
Pseudocode Yes Algorithm 1 MISQ-HF Require: Dataset S, question ratio ฬ–, Embedding model, cluster-embeddings hashmap C, similarity threshold ฬ“, maximum turns T, MCTS iterations K, LLM
Open Source Code Yes 1Our code is available at github.com/harshita-chopra/misq-hf.
Open Datasets Yes We use the following datasets preprocessed by [8]. In Medical Diagnosis, a patient initially reports a brief description of their symptoms... The DX dataset [21] contains 104 doctor-patient dialogues and five diseases in its test set. The Med DG dataset, which originally included over 17,000 conversations across 15 disease types, was refined by removing inconsistent samples. We used 454 high-quality samples for evaluation. ... We use the Flo Dial dataset [15], containing 153 dialogues across 153 unique fault types. ... The Common dataset includes 111 items spanning categories such as animals, places, food, and objects, and the Things dataset [7], was filtered to include 200 items.
Dataset Splits No The DX dataset [21] contains 104 doctor-patient dialogues and five diseases in its test set. The Med DG dataset, which originally included over 17,000 conversations across 15 disease types, was refined by removing inconsistent samples. We used 454 high-quality samples for evaluation. ... We use the Flo Dial dataset [15], containing 153 dialogues across 153 unique fault types. ... The Common dataset includes 111 items... and the Things dataset [7], was filtered to include 200 items.
Hardware Specification Yes Experiments were run on an 8-core CPU with 16 GB RAM.
Software Dependencies No Llama 3.3 70B Instruct [5] and Mixtral 8*7B Instruct [10] were accessed via the AWS Bedrock [1]. GPT-4o was accessed via API from Open AI [13]. The user (Answerer) was simulated by Llama 3.3 70B Instruct in all tasks. ... Problem descriptions were embedded using Distil BERT [16] for the troubleshooting domain, and Clinical-BERT [19] for medical diagnosis.
Experiment Setup Yes We set the number of iterations K = 10 and exploration constant C = 0.2. Maximum simulation depth ds was set to 3 to balance computational efficiency with search effectiveness. For each ฬ’v, the LLM was prompted to generate m = 3 potential questions to maintain diversity. For the reward calculation in RIG(v), the scaling parameter ฬ“ was set to 0.4. ... We used a decay factor ฬ“ = 0.9 for the bonus rewards. The cluster similarity threshold ฬ“ was set to 0.9 in terms of cosine similarity, and the bonus scaling factor ฬ’ was set to 0.2 for all tasks.