Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Authors: Xing Han Lu, Zdeněk Kasner, Siva Reddy

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs.
Researcher Affiliation	Collaboration	Xing Han L u * 1 2 Zdenˇek Kasner * 1 3 Siva Reddy 1 2 4 1Mila Quebec AI Institute 2Mc Gill University 3Institute of Formal and Applied Linguistics, Charles University 4Facebook CIFAR AI Chair.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code	Yes	Our code, data and models are available for research: https://mcgillnlp.github.io/weblinx.
Open Datasets	Yes	To address this problem, we introduce WEBLINX ( 3), a benchmark containing 2337 demonstrations of conversational web navigation produced by human experts across 155 real-world websites. ...Our code, data and models are available for research: https://mcgillnlp.github.io/weblinx.
Dataset Splits	Yes	In addition to a TRAIN split, we create VALID and TESTIID to assess in-domain generalization, and 4 out-of-domain splits for various scenarios (see Table 2). ... VALID In-domain demos for hyperparameters selection
Hardware Specification	Yes	Using the same environment, CPU (AMD EPYC 7453) and GPU (RTX A6000)...
Software Dependencies	No	The paper mentions software components like "transformers library" (Wolf et al., 2019), AdamW optimizer, FSDP, and Brain float16 precision, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version, exact Transformers library version).
Experiment Setup	Yes	Table 14: The training hyperparameters of all models. We give the number of epochs, the batch size (batch), the learning rate (LR), the number of gradient accumulation steps (Accum.), the number of warmup steps (Warm.) and if the model uses flash attention (FA2; Dao et al. 2022; Dao 2023).