Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales
Authors: Tang Li, Mengmeng Ma, Xi Peng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model s rationale correctness, improving localization by 7.5% and disentanglement by 36.5%. Our dataset, source code, and pretrained weights: https://github.com/deep-real/DCP |
| Researcher Affiliation | Academia | Department of Computer & Information Science, University of Delaware |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our dataset, source code, and pretrained weights: https://github.com/deep-real/DCP |
| Open Datasets | Yes | Evaluation datasets: We validate the prediction correctness of the models on image classification and image-text retrieval tasks. For image classification (zero-shot, linear probe), experiments are carried out on nine benchmark datasets, including CUB [17], Caltech101 [58], Oxford Pets [59], Food101 [60], SUN397 [61], Stanford Cars [62], DTD [63], CIFAR-10 [64], and CIFAR-100 [64]. For retrieval, we conduct experiments on Flickr30K [65] and MSCOCO [66]. To evaluate the correctness of rationales, we evaluate the models rationale localizability on CUB-Part [67] and Part Image Net [68] that provide ground truth segmentation masks of object parts... |
| Dataset Splits | Yes | More details can be found in Appendix D. (Appendix D, Table 11 lists dataset sizes and the NeurIPS checklist Q6 states: "The paper details all aspects of the experimental settings, including data splits.") |
| Hardware Specification | Yes | The NeurIPS checklist Q8 states: "The paper adequately details the computational resources required for each experiment, including the types of compute workers (CPU or GPU), memory specifications, and execution times." |
| Software Dependencies | No | The paper mentions specific models and optimizers like "CLIP-Vi T architectures [37]" and "Adam W [69] optimizer" and that GPT-4 was used for data generation, but it does not specify version numbers for general software libraries or frameworks (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We follow the same architecture design as CLIP [1] for Vi T-B/32. The input resolution of image encoder is 224 224 and the maximum context length of text encoder is 77. We train our model using an Adam W [69] optimizer and the cosine learning rate scheduler with a linear warmup. Specifically, the learning rate linearly increases from 0 to the peak value within 10% of the total steps, and then decreases with a cosine anneal strategy. Our learning rate is set to 5e-7 and train the model for eight epochs. |