Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Workflow Discovery from Dialogues in the Low Data Regime

Authors: Amine El hattami, Issam H. Laradji, Stefania Raimondo, David Vazquez, Pau Rodriguez, Christopher Pal

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present experiments where we extract workflows from dialogues in the Action-Based Conversations Dataset (ABCD). Since the ABCD dialogues follow known workflows to guide agents, we can evaluate our ability to extract such workflows using ground truth sequences of actions. We propose and evaluate an approach that conditions models on the set of possible actions, and we show that using this strategy, we can improve WD performance.
Researcher Affiliation	Collaboration	Amine El Hattami EMAIL Service Now Research Polytechnique Montréal, Montréal, Canada
Pseudocode	No	The paper describes methodologies like casting tasks as text-to-text problems and outlines input/output formats (e.g., in Figure 9). However, it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Code available at https://github.com/Service Now/workflow-discovery
Open Datasets	Yes	ABCD (Chen et al., 2021) contains over 10k human-to-human dialogues split over 8k training samples and 1k for each of the eval and test sets. Multi WOZ (Budzianowski et al., 2018) contains over 10k dialogues with dataset splits similar to ABCD across eight domains: Restaurent, Attraction, Hotel, Taxi, Train, Bus, Hospital, Police.
Dataset Splits	Yes	ABCD (Chen et al., 2021) contains over 10k human-to-human dialogues split over 8k training samples and 1k for each of the eval and test sets. ... Table 17: WD dataset statistics for ABCD and Multi Woz datasets. ABCD # Train Samples 8034 # Dev Samples 1004 # Test Samples 1004. Multi Woz # Train Samples 5048 # Dev Samples 527 # Test Samples 544
Hardware Specification	Yes	Finally, We ran all experiments on 4 NVIDIA A100 GPUs with 80G memory, and the training time of the longest experiment was under six hours.
Software Dependencies	No	The paper mentions using 'Huggingface Transformers Pytorch implementation' but does not specify version numbers for PyTorch or the Transformers library. It also refers to models like T5, BART, and PEGASUS but without specific software versions.
Experiment Setup	Yes	We fine-tuned all models on the WD tasks for 100 epochs for all experiments with a learning rate of 5e-5 with linear decay and a batch size of 16. We set the maximum source length to 1024 and the maximum target length to 256. For the BART model, we set the label smoothing factor to 0.1. We fine-tuned AST-T5, and CDS-T5 for 14 and 21 epochs, respectively, matching the original work of Chen et al. (2021), and used similar hyper-parameters used for the WD task. In all experiments, we use rmin = 10 as described in Section 4.1.