Towards In-context Scene Understanding

Authors: Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelović, Olivier Henaff

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the generality of Hummingbird representations through retrieval-based scene understanding on several downstream tasks (Section 4.1): semantic segmentation on PASCAL VOC [25] and ADE20K [83] with mean Io U (m IOU) as metric, and monocular depth estimation on NYUv2 [60] with root-mean-square error (RMSE) as metric. We further show that, in the low-data regime (Section 4.2) and when looking at adaptation speed (Section 4.3), Hummingbird with NN retrieval outperforms other pretraining techniques and decoding mechanisms, including end-to-end finetuning.
Researcher Affiliation Collaboration Ivana Balaževi c David Steiner Nikhil Parthasarathy Relja Arandjelovi c Olivier J. Hénaff Google Deep Mind *Equal contribution. Current affiliation: NYU CNS, work done while interning at Google Deep Mind. Correspondence to {balazevic, davidsteiner, henaff}@google.com.
Pseudocode No The paper describes its methods using textual descriptions and mathematical equations (e.g., Equation 1 defines the cross-attention operation), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the 'open-source Sca NN library [29]' but does not state that the authors' own code for the Hummingbird model or their methodology is open source or provide a link to it. It refers to 'publicly available checkpoints' of other models in Table 1.
Open Datasets Yes We demonstrate the generality of Hummingbird representations through retrieval-based scene understanding on several downstream tasks (Section 4.1): semantic segmentation on PASCAL VOC [25] and ADE20K [83] with mean Io U (m IOU) as metric, and monocular depth estimation on NYUv2 [60] with root-mean-square error (RMSE) as metric. We pretrain the model for 300 epochs on Image Net-1k or 100 epochs on Image Net-22k using Adam W [47] with a batch size of 4096.
Dataset Splits Yes Given a prompt composed of training images from the downstream task and their corresponding labels {(xi, yi), i = 1, ..., N, xi RH W C}, our aim is to enable a pretrained image encoder fθ to make predictions about a new image x from the test set. The paper uses standard, well-defined datasets like PASCAL VOC and ADE20K, which have established train/test/validation splits.
Hardware Specification Yes Evaluation was done on a single Nvidia A100 GPU per downstream task and takes approximately 15 minutes for PASCAL VOC, 25 minutes for ADE20K, and 30 minutes for NYUv2. We pretrain the model for 300 epochs on Image Net-1k or 100 epochs on Image Net-22k using Adam W [47] with a batch size of 4096, split across 128 Cloud TPU v3 workers for Image Net-1k and 256 Cloud TPU v3 workers for Image Net-22k.
Software Dependencies No The paper mentions using the 'open-source Sca NN library [29]' for efficient approximate nearest neighbor search, and provides a link to its documentation. However, it does not specify a version number for ScaNN itself, nor does it list versions for other software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or CUDA.
Experiment Setup Yes We pretrain the model for 300 epochs on Image Net-1k or 100 epochs on Image Net-22k using Adam W [47] with a batch size of 4096... We update the online parameters θ with a cosine learning rate schedule with a base learning rate of 0.001, weight decay of 0.1 and gradient clipping with a maximum norm of 1. We update the target parameters ξ as an exponential moving average of the online parameters with a decay rate of 0.99. Preliminary analysis showed λ=0.2 to work well across datasets, so we use it for all our experiments. Table 6 summarizes the hyperparameters used for NN evaluation throughout this work.