reproducibilityindex.ai

Classifying Emails into Human vs Machine Category

Authors: Changsung Kang, Hongwei Shang, Jean-Marc Langlois7069-7077

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on editorial data show that our full model improves the adjusted-recall from 70.5% to 78.8% and the precision from 94.7% to 96.0% compared to the old production model. Also, our full model signiﬁcantly outperforms a state-of-the-art BERT model at this task.
Researcher Affiliation	Industry	Changsung Kang1, Hongwei Shang2, Jean-Marc Langlois2 1 Walmart Global Tech 2 Yahoo Research
Pseudocode	No	The paper describes model architectures with diagrams and textual explanations of their components and flow, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'Our new model has been deployed to the current production system (Yahoo Mail 6)', but does not provide any information or links regarding open-source code availability for the methodology described.
Open Datasets	No	Both the training and testing data are constructed by sampling from the Yahoo Mail corpus. Previous Yahoo researchers (Grbovic et al. 2014) took great effort to collect ground-truth labels for human and machine category at sender level. However, no concrete access information (such as a public link, DOI, or repository) is provided for these datasets, implying they are not publicly available.
Dataset Splits	Yes	Over-sampling was used to create validation data (4K+ messages) and test data (4K+ messages), for resolving the small class issue (details presented in Appendix). For all models we trained (including BERT), the best checkpoint is selected based on our validation data.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running the experiments, such as GPU/CPU models, processors, or memory.
Software Dependencies	No	The paper mentions using 'Adam optimizer' but does not specify any software dependencies with version numbers, such as programming languages, deep learning frameworks, or other libraries.
Experiment Setup	Yes	All these models use Adam optimizer with learning rate 0.001 and batch size 128. We use dropout rate 0.4, 0.6, 0.4, 0.6 for content, sender, action, and salutation model respectively. For deciding maximum sequence length s during training, we use ssubject = 30 and scontent = 1000 for content model and action model; use saddress = 1000 (representing the number of characters) and sname = 30 for sender model; use ssalutation = 10 for salutation model.