Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Authors: Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing Chat GPT with only a 1.8-billion-parameter LLM. To verify our approach, we conducted experiments widely on Ex TES (Zheng et al. 2023a), ESConv (Liu et al. 2021b) and P4G (Wang et al. 2019b). We compare our method with various baselines, demonstrating its effectiveness. Detailed analysis experiments further support the framework s validity. Extensive experiments across three proactive dialogue benchmarks show our approach outperforms baselines, with analysis confirming its effectiveness.
Researcher Affiliation	Academia	1Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Harbin, China 2Singapore Management University, Singapore 3School of Computer Science, Fudan University EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the algorithms and framework through text and diagrams (Figure 1), but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	We evaluate the proposed framework on two typical applications of proactive dialogues, Ex TES (Zheng et al. 2023b) (emotional support) and P4G (Wang et al. 2019a) (persuasion), representing collaborative and non-collaborative dialogue, respectively. Ex TES is an extension of ESConv (Liu et al. 2021b), comprising sufficient dialogues for training (11,117 complete dialogues).
Dataset Splits	Yes	We randomly divide it into 10,717/200/200 for train/valid/test set. P4G includes 1,017 donation persuasion dialogues where a persuader attempts to persuade a persuadee to donate to a charity called Save the Children. We randomly choose 100/100 dialogues for validation/testing. We take the remaining 817 dialogues as the training set.
Hardware Specification	No	The paper mentions 'Due to the hardware limitations, we select models under 7B parameters.' but does not provide specific details about the GPU/CPU models, memory, or other hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions specific models like 'Ro BERTa-Large (Liu et al. 2019)' and 'Qwen1.5-1.8b (Bai et al. 2023)', and also the critic 'Chat GPT (gpt3.5-turbo-0613 and -0125)', but it does not list any ancillary software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions) required to reproduce the environment.
Experiment Setup	Yes	LDPP is implemented with (T, L, K) = (8, 6, 24). Experiments are conducted on the Ex TES dataset with K= 6, 12, 18, and 24, while keeping other hyper-parameters constant (T = 8, L = 4). We set T as 2, 8, 16, and 24 while keeping (L = 4, K = 24). Lo RA Finetuning (32, 64) means setting lora rank=x and lora alpha=y. Lo RA Finetuning (64, 128). τ 0 is the hyperparamter. And δ is a predefined threshold.