WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Authors: Xing Han Lu, Zdeněk Kasner, Siva Reddy

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs.
Researcher Affiliation Collaboration Xing Han L u * 1 2 Zdenˇek Kasner * 1 3 Siva Reddy 1 2 4 1Mila Quebec AI Institute 2Mc Gill University 3Institute of Formal and Applied Linguistics, Charles University 4Facebook CIFAR AI Chair.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes Our code, data and models are available for research: https://mcgillnlp.github.io/weblinx.
Open Datasets Yes To address this problem, we introduce WEBLINX ( 3), a benchmark containing 2337 demonstrations of conversational web navigation produced by human experts across 155 real-world websites. ...Our code, data and models are available for research: https://mcgillnlp.github.io/weblinx.
Dataset Splits Yes In addition to a TRAIN split, we create VALID and TESTIID to assess in-domain generalization, and 4 out-of-domain splits for various scenarios (see Table 2). ... VALID In-domain demos for hyperparameters selection
Hardware Specification Yes Using the same environment, CPU (AMD EPYC 7453) and GPU (RTX A6000)...
Software Dependencies No The paper mentions software components like "transformers library" (Wolf et al., 2019), AdamW optimizer, FSDP, and Brain float16 precision, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version, exact Transformers library version).
Experiment Setup Yes Table 14: The training hyperparameters of all models. We give the number of epochs, the batch size (batch), the learning rate (LR), the number of gradient accumulation steps (Accum.), the number of warmup steps (Warm.) and if the model uses flash attention (FA2; Dao et al. 2022; Dao 2023).