Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Generalist Robot Learning from Internet Video: A Survey
Authors: Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt , Fernando Acero , Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li
JAIR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This survey systematically examines the emerging field of Lf V. We first outline essential concepts, including detailing fundamental Lf V challenges such as distribution shift and missing action labels in video data. Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming Lf V challenges, and improving robot learning through video-informed training. The survey concludes with a critical discussion of future opportunities. |
| Researcher Affiliation | Collaboration | ROBERT MCCARTHY , University College London, United Kingdom DANIEL C.H. TAN, University College London, United Kingdom DOMINIK SCHMIDT, Weco AI, United Kingdom FERNANDO ACERO, University College London, United Kingdom NATHAN HERR, University College London, United Kingdom YILUN DU, Massachusetts Institute of Technology, United States of America THOMAS G. THURUTHEL, University College London, United Kingdom ZHIBIN LI, University College London, United Kingdom |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It is a survey paper that reviews existing literature, rather than presenting a new algorithm. |
| Open Source Code | No | The paper is a survey and does not describe a new methodology for which source code would be released. There is no explicit statement or link indicating the release of open-source code for the work presented in this paper. |
| Open Datasets | Yes | Table 1. Existing video datasets. Listed are: (top) large-scale, internet-scraped video datasets, and (bottom) robotics-relevant, manually-recorded video datasets. The datasets are ordered by decreasing total video duration. Details regarding Caption Type can be found in Section 5.1.2. Dataset Content Size (hours) # Clips Caption Type Collection Method Intern Vid [284] You Tube 760,000 230M Generated Internet HD-VILA-100M [300] You Tube 370,000 103M ASR Internet YT-Temporal-180M [322] You Tube 180M ASR Internet WTS-70M [255] You Tube 190,000 70M Metadata Internet How To100M [181] Instruction 134,000 136M ASR Internet Web Vid-10M [17]1 You Tube 52,000 10M Alt-text Internet Video CC3M [187]1 You Tube 18,000 6M Transfer Internet 100 Days of Hands [239] Actions 3,100 27k Metadata Internet Ego-4D [87] Everyday 3,600 28k Manual Manual Ego-Exo-4D [86] Skilled 1,400 6k Manual Manual SS-v2 [85] Actions 245 221k Manual Manual Robo VQA [235] Everyday 230 98k Manual Manual Epic-Kitchens-100 [58] Cooking 100 700 Manual Manual |
| Dataset Splits | No | The paper is a survey of existing research and does not present new experimental results that would require specifying training/test/validation dataset splits. It discusses datasets in general terms but does not define splits for its own work. |
| Hardware Specification | No | The paper is a survey and does not conduct its own experiments. Therefore, no hardware specifications for running experiments are provided. |
| Software Dependencies | No | The paper is a survey and does not conduct its own experiments. It does not list specific software dependencies with version numbers required for replication of its own work. |
| Experiment Setup | No | The paper is a survey of existing research and does not describe an experimental setup with hyperparameters or training configurations for its own methodology. |