Siamese Masked Autoencoders
Authors: Agrim Gupta, Jiajun Wu, Jia Deng, Fei-Fei Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite its conceptual simplicity, features learned via Siam MAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. |
| Researcher Affiliation | Academia | Agrim Gupta1 Jiajun Wu1 Jia Deng2 Li Fei-Fei1 1Stanford University, 2Princeton University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions building on 'the open-source implementation of MAEs (https://github.com/facebookresearch/mae)' but does not explicitly state that the source code for their own method, Siam MAE, is being released or provide a direct link to it. |
| Open Datasets | Yes | Models are pre-trained using Kinetics-400 [93] for self-supervised learning. [93] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. ar Xiv preprint ar Xiv:1705.06950, 2017. |
| Dataset Splits | Yes | We evaluate the quality of learned representations for dense correspondence task using k-nearest neighbor inference on three downstream tasks: video object segmentation (DAVIS-2017 [95]), human pose propagation (JHMDB [96]) and semantic part propagation (VIP [97]). Following prior work [14–16], all tasks are formulated as video label propagation: given the ground-truth label for the initial frame, the goal is to predict the label for each pixel in future frames of a video. |
| Hardware Specification | Yes | All our experiments are performed on 4 Nvidia Titan RTX GPUs for Vi T-S/16 models, and on 8 Nvidia Titan RTX GPUs for Vi T-S/8 models and Vi T-B models. |
| Software Dependencies | No | The paper mentions using 'Adam W' optimizer and building on 'the open-source implementation of MAEs (https://github.com/facebookresearch/mae)', but does not provide specific software versions for libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | Models are pre-trained using Kinetics-400... Siam MAE takes as input pairs of randomly sampled frames (224 224) with a frame gap ranging from 4 to 48 frames... Training is done for 400 epochs for the ablation studies and for 2000 epochs for the results... We use the Adam W optimizer with a batch size of 2048. Additional details are provided in Table 4a, including learning rate 1.5e-4, weight decay 0.05, cosine decay schedule, and 40 warmup epochs. |