CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval
Authors: Hai X. Pham, Ricardo Guerrero, Vladimir Pavlovic, Jiatong Li2423-2430
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are not only able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision, but we can also learn more meaningful feature representations of food recipes, appropriate for challenging cross-modal retrieval and recipe adaption tasks. and Experiments In this section we will use L, T, G, and S as shorthand for LSTM, Tree-LSTM, GRU and Set (Zaheer et al. 2017), respectively. |
| Researcher Affiliation | Collaboration | 1 Samsung AI Center, Cambridge 2 Department of Computer Science, Rutgers University {hai.xuanpham, r.guerrero, v.pavlovic}@samsung.com, jl2312@rutgers.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code of our proposed method is available at https://github.com/haixpham/CHEF. |
| Open Datasets | Yes | During the preparation of the work presented here, all experiments were conducted using data from Recipe1M (R1M) (Salvador et al. 2017; Mar ın et al. 2019). |
| Dataset Splits | Yes | Data is split into 70% train, 15% validation and 15% test sets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions various models and architectures like 'word2vec model' and 'ResNet50', but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We empirically set ϵ to 0.3 by cross-validation. and Finally, h is projected to the shared space by three fully connected (FC) layers, each of dimensionality 1024, to yield the latent text features p R1024. and Res Net50 (He et al. 2016) pre-trained on Image Net is used as the backbone for feature extraction, where the last FC layer is replaced with three consecutive FC layers (similar to the recipe encoder) to project the extracted features into the shared latent space to get q R1024. |