Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Authors: Changsheng Lv, Shuai Zhang, Yapeng Tian, Mengshi Qi, Huadong Ma

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that our proposed method improves baseline methods and achieves state-of-the-art performance.
Researcher Affiliation Academia Changsheng Lv1,2 , Shuai Zhang1,2 , Yapeng Tian3, Mengshi Qi1,2B , and Huadong Ma1,2 1Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia 2Beijing University of Posts and Telecommunications 3Department of Computer Science, The University of Texas at Dallas {lvchangsheng, zshuai, qms, mhd}@bupt.edu.cn, yapeng.tian@utdallas.edu
Pseudocode No The paper describes the proposed approach in text and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code is available at https://github.com/Andy20178/DCL.
Open Datasets Yes The Physical Audiovisual Common Sense Reasoning Dataset (PACS) [2] is a collection of 13,400 question-answer pairs designed for testing physical commonsense reasoning capabilities.
Dataset Splits Yes Following [2], We divide PACS into 11,044/1,192/1,164 as train/val/test set, each of which contains 1,224/150/152 objects respectively. We partitioned the PACS-Material subset into 3,460/444/445 for train/val/test under the same object distribution as PACS.
Hardware Specification Yes We implement our proposed model with Py Torch on two NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions 'Py Torch' but does not specify its version number or other software dependencies with versions.
Experiment Setup Yes Specifically, we downsampled each video to T = 8 frames during pre-processing and set the feature dimension as d = 256. In the Disentangled Sequence Encoder, we used a hidden layer size of 256 for Bi-LSTM. During optimization, we set the batch size as 64, which consisted of 64 video pairs and the corresponding questions. In the Counterfactual Learning Module, τ = 2 and k = 5 were used when calculating similarities and constructing the physical knowledge relationships.