Soft Target-Enhanced Matching Framework for Deep Entity Matching

Authors: Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, Ge Yu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments over open datasets and the results show that our proposed STEAM outperforms the state-of-the-art EM approaches in terms of effectiveness and label efficiency. Experiments In this section, we evaluate our proposed STEAM framework on two open EM benchmarks (eight datasets) to demonstrate its performance against existing SOTA methods.
Researcher Affiliation Academia 1Northeastern University, China 2RMIT University, Australia 3University of Illinois at Urbana-Champaign, USA
Pseudocode No The paper describes the model architecture and training process with equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing its source code or provide a link to a code repository for the described methodology.
Open Datasets Yes We evaluate STEAM framework on WDC benchmark (Primpeli, Peeters, and Bizer 2019) and Deep Matcher benchmark (Mudgal et al. 2018). The summaries of them are shown in Table 2 and Table 3.
Dataset Splits Yes For WDC benchmark, we split the training/validation sets with the ratio of 4:1, which is the same as Ditto (Li et al. 2020b). For Deep Matcher benchmark, we split the training/validation/testing sets with the ratio of 3:1:1, which is the same as existing methods like Deep Matcher (Mudgal et al. 2018), Ditto (Li et al. 2020b), and Hier GAT (Yao et al. 2022).
Hardware Specification No The paper mentions implementing the framework with PyTorch and Hugging Face and using RoBERTa-base, but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud computing instance details) used for the experiments.
Software Dependencies No The paper mentions using PyTorch and Hugging Face libraries, and BERT-like PLMs (BERT, RoBERTa, DistilBERT), but it does not specify any version numbers for these software components.
Experiment Setup Yes The size of mini-batch is 64, and the maximum length of the input is limited to 128 (256 for A-B dataset) that any tokens beyond that are truncated. We train STEAM using Adam optimizer and the learning rate is 3e-5. The maximum training epoch is 50, and we adopt the early stop strategy with the patients varying from 5 to 15 according to the datasets. We adopt data augmentation (e.g., dropping token, swapping record, and swapping attribute values) and dropout strategy with the probability of 0.5. For the part of soft supervised training, we set the temperature τ = 10 to automatically generate the soft labels. And the hyperparameter λ is set to be 1.