PatchGame: Learning to Signal Mid-level Patches in Referential Games

Authors: Kamal Gupta, Gowthami Somepalli, Anubhav Anubhav, Vinoj Yasanga Jayasundara Magalle Hewa, Matthias Zwicker, Abhinav Shrivastava

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess the success of Patch Game via qualitative and quantitative evaluations of each of the proposed component, and by demonstrating some practical applications. We use the top patches indicated by this model to classify Image Net [16] images using a pre-trained Vision Transformer [19] and show that we can retain over 60% top-1 accuracy with just half of the image patches. Table 1 shows the Top-1 and Top-5 % accuracy obtained on Image Net using the listener s vision module and the baselines approaches. Our results are shown in the Table 2. 4.5 Ablation study
Researcher Affiliation Academia Kamal Gupta kampta@umd.edu Gowthami Somepalli gowthami@umd.edu Anubhav Gupta anubhav@umd.edu Vinoj Jayasundara vinoj@umd.edu Matthias Zwicker zwicker@umd.edu Abhinav Shrivastava abhinav@cs.umd.edu University of Maryland, College Park
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://kampta.github.io/patch-game.
Open Datasets Yes All the experiments are conducted on the training set of Image Net [16], which has approximately 1.28 million images from 1000 classes. Further, we use the listener s vision module as a pre-trained network for Pascal VOC dataset [21].
Dataset Splits Yes We create a training and validation split from the training set by leaving aside 5% of the images for validation.
Hardware Specification No The paper mentions 'split over 4 GPUs' but does not specify the GPU model or any other specific hardware details.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We use a 2-layer MLP for symb, Res Net-9 [31] for rank, a Res Net-18 [31] for φvision, and a small transformer encoder (hidden size = 192, 3 heads, 12 layers) for φtext. All the experiments are conducted on the training set of Image Net [16]... We create a training and validation split from the training set by leaving aside 5% of the images for validation. After obtaining the final set of hyper-parameters, we retrain on the entire training set for 100 epochs. We use Stochastic Gradient Descent (SGD) with momentum and cosine learning rate scheduling. In our experiments, we use an effective batch size of 512 (split over 4 GPUs)... We fix the learning rate at 0.0001... we are using a vocabulary V = 128 and patch size S = 32 in our base model.