RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Authors: Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, Mingqian Tang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Results and Analysis, 4.2 Ablation studies and analysis. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-Par Se, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Alibaba Group 3University of Cambridge 4National University of Singapore |
| Pseudocode | No | The paper describes methods in text, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present. |
| Open Source Code | No | Code will be available at https://github.com/Jacob Yuan7/RLIP. |
| Open Datasets | Yes | Datasets. We use the Visual Genome (VG) [32] dataset for RLIP. ... For downstream tasks, we conduct experiments on HICO-DET [5] and V-COCO [14]. |
| Dataset Splits | Yes | HICO-DET contains 37,536 training images and 9,515 testing images, annotated with 600 HOI triplets derived from combinations of 117 verbs and 80 objects. We evaluate under the Default setting. V-COCO comprises 2,533 training images, 2,876 validation images and 4,946 testing images annotated with 24 interactions and 80 objects. |
| Hardware Specification | Yes | Experiments are conducted on 8 Tesla V100 GPU cards with a minibatch size of 32. |
| Software Dependencies | No | The paper mentions software components like RoBERTa and Transformer encoder, and architectures like DETR and DDETR, but does not provide specific version numbers for these or other underlying software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Implementation details. ... For Parallel Entity Detection and Sequential Relation Inference, 3 decoding layers are used. The number of queries NQ is set to 100 during pre-training and 64 during fine-tuning (following [68]). γ in the Focal loss is set to 2 following [57, 68]. NL in LSE is set to 500 to ensure computational tractability. η in RPL is set to 0.3. For pre-training and fine-tuning, the initial learning rate (LR) of the image and text encoders is set to 1e-5, while all other modules are set to 1e-4. ... Experiments are conducted on 8 Tesla V100 GPU cards with a minibatch size of 32. |