reproducibilityindex.ai

What Do You Mean ‘Why?’: Resolving Sluices in Conversations

Authors: Victor Petrén Bach Hansen, Anders Søgaard7887-7894

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a crowd-sourced dataset containing annotations of sluices from over 4,000 dialogues collected from conversational QA datasets, as well as a series of strong baseline architectures. We conduct a series of baseline experiments on this task, using both encoder-decoder frameworks, as well as language modelling objectives, and show through human evaluation of the predicted resolutions that these baselines are quite strong and at times even rival the quality of human annotators.
Researcher Affiliation	Collaboration	Victor Petr en Bach Hansen,1,2 Anders Søgaard1,3 1Department of Computer Science, University of Copenhagen, Denmark 2Topdanmark A/S, Denmark 3Google Research, Berlin {victor.petren, soegaard}@di.ku.dk
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release the raw annotated version of the conversational sluicing corpus, as well as our cleaned version which we report our results on, including the splits used.3 3https://github.com/vpetren/conv_sluice_resolution
Open Datasets	Yes	we crawl existing conversational QA datasets, namely Qu AC1 and Co QA,2 for question-answer contexts with one-word follow-up questions. 1https://quac.ai/ 2https://stanfordnlp.github.io/coqa/
Dataset Splits	Yes	In our experiments, we use the splits outlined in Table 1 (also made publicly available). Split Why Where Who What When Total train 851 714 513 302 702 3082 val 84 71 54 39 52 300 test 229 183 97 83 201 793 Total 1164 968 664 424 955 4175
Hardware Specification	Yes	Unlike the LSTMseq2seq and Transformer, we do not ﬁne-tune the GPT-2 model until convergence, but instead we ran it for 18 hours on an Nvidia Titan X GPU.
Software Dependencies	No	The paper mentions software like GloVE, Adam optimizer, LSTM, Transformer, and GPT-2, and points to PyTorch implementations, but does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	For both the encoder and decoder, we use a standard two-layer LSTM (Hochreiter and Schmidhuber 1997), with a hidden state size of 512, and regularized using a dropout rate of 0.5. We initialize the embedding matrix with 300 dimensional GloVE (Pennington, Socher, and Manning 2014), which remains ﬁxed during training. We optimize the end-to-end network using Adam (Kingma and Ba 2014), with the default learning rate of 0.001. As our conversational sluicing resolution corpus is small in comparison to the corpora used in the experiments by Vaswani et al. (2017), we limit ourselves to three encoder/decoder layers to 3 (compared to 6 in their work), after observing improvements on our validation data.