Tabular Insights, Visual Impacts: Transferring Expertise from Tables to Images
Authors: Jun-Peng Jiang, Han-Jia Ye, Leye Wang, Yang Yang, Yuan Jiang, De-Chuan Zhan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare CHARMS with crossmodal transfer methods on several datasets. The analysis experiment and ablations verify the effectiveness of our method. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, Nanjing University, Nanjing, China. 2National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China. 3Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education & School of Computer Science, Peking University, Beijing, China. 4School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. |
| Pseudocode | No | The paper does not contain any pseudocode blocks or clearly labeled algorithm sections. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | Totally six datasets are used in the experiment: Data Visual Marketing (DVM) (Huang et al., 2022), SUNAttribute (Patterson et al., 2014): We use the table modality in this experiment to help images more accurately predict whether a scene is an open space, which is a binary classification task. Celeb A (Liu et al., 2015) is the abbreviation of Celeb Faces Attribute, meaning celebrity face attribute dataset. Pet Finder-adoption dataset comes from a kaggle competition where the task is to predict the speed at which a pet is adopted, which is a five-class classification task. Pet Finder-pawpularity dataset also comes from a kaggle competition where the task was to predict the popularity of a pet based on that pet s profile and photo. Avito is a challenge to predict demand for an online advertisement based on its full description, its context and historical demand for similar ads in similar contexts. |
| Dataset Splits | Yes | We pair this tabular data with a single random image from each advertisement, yielding a dataset of 70, 580 train pairs, 17, 645 validation pairs, and 88, 226 test pairs. [...] We use 8 : 1 : 1 to divide the training set, validation set, and testing set. |
| Hardware Specification | No | The paper mentions conducting experiments "with a single GPU" and that "Our method uses 8 GB of memory," but it does not specify the exact model of the GPU or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions using "PyTorch" and "FT-Transformer" but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Specifically, the batch size k is searched in {32, 64, 128} and the learning rate is searched in {1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3}. [...] For FT-Transformer, the number of Transformer blocks is set to 2. We use the K-Means method to cluster the representations obtained by Res Net50 and n cluster is 40. Embedding dimension E is set according to the data distribution. Adam optimizer with weight decay is used to train the models. We choose to update cost matrix every 5 epochs, which ensures that the model learns increasingly accurate channel-attribute correspondences, allowing the tabular data to guide the image data with increasing precision. |