Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reading Recognition in the Wild

Authors: Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Yuheng Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.
Researcher Affiliation Collaboration 1Meta Reality Labs Research 2VGG, University of Oxford 3The Ohio State University
Pseudocode No The paper describes the model architecture and steps in natural language and diagrams (Figure 4) but does not provide structured pseudocode or algorithm blocks.
Open Source Code No Code, model, and data will be public. The code and models will be released alongside the dataset.
Open Datasets Yes https://www.projectaria.com/datasets/reading-in-the-wild/
Dataset Splits No We split the Seattle subset into training, validation, and test sets, and train the model on the training set. We evaluate on (i) the test set of the Seattle subset, and (ii) the entire Columbus subset. Table 1: Seattle subset ... Alternating (train/val/test) ... Columbus subset ... Mirror setups (test).
Hardware Specification No All models are trained using a single GPU. The model can indeed comfortably run real-time on Aria Gen 2 glasses on-device.
Software Dependencies No Audio transcribed using Whisper X [2]. We use Adam optimizer.
Experiment Setup Yes Model. For the encoders, we use three layers of 1D convolution (kernel size 9, 32 dims) for gaze and IMU, and three layers of 2D convolution (kernel size 5, 32 dims) for RGB. We then feed the tokens as input to three layers of transformer encoder (32 dims, 2 heads) before linearly projecting the [CLS] token to two classes. The combined model is lightweight, with 137k parameters. Training. We impose modality dropout such that there is an equal probability of using one, two, or three modalities at the same time, as well as perform rotation augmentation. We use Adam optimizer with learning rate 1e 3 for ten epochs. All models are trained using a single GPU.