An Extensible Multi-modal Multi-task Object Dataset with Materials

Authors: Trevor Scott Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. ... EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale. ... Table 1 shows the performance of all of our models on the test set.
Researcher Affiliation Collaboration Trevor Standley1 Ruohan Gao1 Dawn Chen2 Jiajun Wu1 Silvio Savarese1,3 tstand@cs.stanford.edu 1Stanford University 2Google Inc. 3Salesforce Research
Pseudocode No The paper describes procedures and processes, but does not include any clearly labeled pseudocode or algorithm blocks. It refers to 'See the code for details' regarding the decoder architecture, implying such details are external to the paper.
Open Source Code No The paper states 'See the code for details.' for some aspects (e.g., decoder architecture) but does not provide an explicit statement of open-sourcing their code for the described methodology or a direct link to a repository.
Open Datasets Yes We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects... We will host not only the core dataset, including our manually added properties, but also any properties that the community develops for EMMa and wants to share.
Dataset Splits Yes The entire filtering process resulted in a dataset with 2,883,698 instances. This was partitioned into 2,806,806 training instances, 26,535 validation instances, and 26,941 test instances, and 23,416 calibration instances.
Hardware Specification Yes Text and vision models were trained on 8x Titan RTX. Each took about three days to train. Everything models were trained on a 1x Titan RTX wokstation in about 12 hours.
Software Dependencies No All models were trained using Pytorch and the Adam W optimizer. While the software used is mentioned, specific version numbers for PyTorch, AdamW, or any other libraries are not provided, preventing full reproducibility of the software environment.
Experiment Setup Yes All models used a batch size of 512. All gradients were clipped to 32. Task losses were combined using a weighted average. The weights are in Table 4. Text and vision models use an encoder-decoder architecture. For both, the encoder is frozen during the first epoch. ... Each model type has its own learning rate schedule and weight-decay, which are listed in Table 5.