Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Workflow Discovery from Dialogues in the Low Data Regime
Authors: Amine El hattami, Issam H. Laradji, Stefania Raimondo, David Vazquez, Pau Rodriguez, Christopher Pal
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experiments where we extract workflows from dialogues in the Action-Based Conversations Dataset (ABCD). Since the ABCD dialogues follow known workflows to guide agents, we can evaluate our ability to extract such workflows using ground truth sequences of actions. We propose and evaluate an approach that conditions models on the set of possible actions, and we show that using this strategy, we can improve WD performance. |
| Researcher Affiliation | Collaboration | Amine El Hattami EMAIL Service Now Research Polytechnique Montréal, Montréal, Canada |
| Pseudocode | No | The paper describes methodologies like casting tasks as text-to-text problems and outlines input/output formats (e.g., in Figure 9). However, it does not contain any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code available at https://github.com/Service Now/workflow-discovery |
| Open Datasets | Yes | ABCD (Chen et al., 2021) contains over 10k human-to-human dialogues split over 8k training samples and 1k for each of the eval and test sets. Multi WOZ (Budzianowski et al., 2018) contains over 10k dialogues with dataset splits similar to ABCD across eight domains: Restaurent, Attraction, Hotel, Taxi, Train, Bus, Hospital, Police. |
| Dataset Splits | Yes | ABCD (Chen et al., 2021) contains over 10k human-to-human dialogues split over 8k training samples and 1k for each of the eval and test sets. ... Table 17: WD dataset statistics for ABCD and Multi Woz datasets. ABCD # Train Samples 8034 # Dev Samples 1004 # Test Samples 1004. Multi Woz # Train Samples 5048 # Dev Samples 527 # Test Samples 544 |
| Hardware Specification | Yes | Finally, We ran all experiments on 4 NVIDIA A100 GPUs with 80G memory, and the training time of the longest experiment was under six hours. |
| Software Dependencies | No | The paper mentions using 'Huggingface Transformers Pytorch implementation' but does not specify version numbers for PyTorch or the Transformers library. It also refers to models like T5, BART, and PEGASUS but without specific software versions. |
| Experiment Setup | Yes | We fine-tuned all models on the WD tasks for 100 epochs for all experiments with a learning rate of 5e-5 with linear decay and a batch size of 16. We set the maximum source length to 1024 and the maximum target length to 256. For the BART model, we set the label smoothing factor to 0.1. We fine-tuned AST-T5, and CDS-T5 for 14 and 21 epochs, respectively, matching the original work of Chen et al. (2021), and used similar hyper-parameters used for the WD task. In all experiments, we use rmin = 10 as described in Section 4.1. |