WIERT: Web Information Extraction via Render Tree

Authors: Zimeng Li, Bo Shao, Linjun Shou, Ming Gong, Gen Li, Daxin Jiang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate WIERT on the Klarna product page dataset, a manually labeled dataset of renderable e-commerce web pages, demonstrating its effectiveness and robustness.
Researcher Affiliation Collaboration 1School of Computer Science and Engineering, Beihang University, Beijing, China 2Microsoft STCA
Pseudocode No The paper describes the model architecture and training process in text and a diagram (Figure 3), but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about open-sourcing the code or provide a link to a code repository.
Open Datasets Yes Klarna product page dataset The Klarna product page dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites(Hotti et al. 2021).
Dataset Splits Yes As we can see, the Klarna dataset provided a official train/test split. In our experiments, we keep the official test set to measure generalization performance and split the official train set into a new train set and a validation set without overlapping according to the ratio of 9 : 1.
Hardware Specification Yes All experiments are conducted on eight V100 GPUs.
Software Dependencies No The paper mentions using a "pretrained Big Bird model" and "BERT or RoBERTa as backbones" but does not specify version numbers for these or any other software dependencies like programming languages or libraries.
Experiment Setup Yes For all experiments, we set the batch size to 16 and use an initial learning rate of 5 10 5, which decays to 85% after each epoch. Through coarse hyperparameter tuning, we set the weights of three losses as λ1 = 1, λ2 = 0.2, λ = 0.1.