Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Authors: Changsheng Lv, Shuai Zhang, Yapeng Tian, Mengshi Qi, Huadong Ma
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we show that our proposed method improves baseline methods and achieves state-of-the-art performance. |
| Researcher Affiliation | Academia | Changsheng Lv1,2 , Shuai Zhang1,2 , Yapeng Tian3, Mengshi Qi1,2B , and Huadong Ma1,2 1Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia 2Beijing University of Posts and Telecommunications 3Department of Computer Science, The University of Texas at Dallas {lvchangsheng, zshuai, qms, mhd}@bupt.edu.cn, yapeng.tian@utdallas.edu |
| Pseudocode | No | The paper describes the proposed approach in text and diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our source code is available at https://github.com/Andy20178/DCL. |
| Open Datasets | Yes | The Physical Audiovisual Common Sense Reasoning Dataset (PACS) [2] is a collection of 13,400 question-answer pairs designed for testing physical commonsense reasoning capabilities. |
| Dataset Splits | Yes | Following [2], We divide PACS into 11,044/1,192/1,164 as train/val/test set, each of which contains 1,224/150/152 objects respectively. We partitioned the PACS-Material subset into 3,460/444/445 for train/val/test under the same object distribution as PACS. |
| Hardware Specification | Yes | We implement our proposed model with Py Torch on two NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number or other software dependencies with versions. |
| Experiment Setup | Yes | Specifically, we downsampled each video to T = 8 frames during pre-processing and set the feature dimension as d = 256. In the Disentangled Sequence Encoder, we used a hidden layer size of 256 for Bi-LSTM. During optimization, we set the batch size as 64, which consisted of 64 video pairs and the corresponding questions. In the Counterfactual Learning Module, τ = 2 and k = 5 were used when calculating similarities and constructing the physical knowledge relationships. |