Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Authors: Shreya Shukla, Nakul Sharma, Manish Gupta, Anand Mishra
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PATENTDESC-355K and PATENTLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology Jodhpur, India 2Microsoft, India |
| Pseudocode | No | The paper describes methods and loss formulations in text and provides an architecture diagram (Figure 2), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project Page https://vl2g.github.io/projects/Patent LMM/ |
| Open Datasets | Yes | We introduce PATENTDESC-355K a novel large-scale dataset tailored for generating descriptions for patent figures. Our proposed dataset comprises 355K patent figures sourced from Google Patents1, with each image accompanied by its brief and detailed descriptions extracted from the corresponding patent documents. The dataset is available for download on our project website: https://vl2g.github.io/projects/Patent LMM/ |
| Dataset Splits | Yes | Table 1: PATENTDESC-355K: Dataset Statistics. Number of Images Train 320,717 Validation 17,286 Test 17,336 Number of Unique Patents Train 50,448 Validation 8,027 Test 7,964 During the creation of training, validation and test set splits, we ensure absolute exclusivity between patents in the train set and those in the combined validation and test sets, to enable robust out-of-sample evaluation. To achieve this, we randomly sampled 12.6K patents from 60K, representing 82.5K images. From this isolated subset of images, we sample 17K images each for the val and test set, and discard the remaining images. |
| Hardware Specification | Yes | The PATENTMME model is trained on 8 V100 GPUs, with an effective batch size of 64 and Adam (Kingma and Ba 2014) optimizer. We train our PATENTLMM with an effective batch size of 192 on 3 A100 GPUs (40 GB). |
| Software Dependencies | No | The paper mentions several tools and models like 'Tesseract OCR engine (Kay 2007)', 'BPE tokenizer (Shibata et al. 1999)', 'Layout LMv3 (Huang et al. 2022)', 'Faster-RCNN (Ren et al. 2015)', 'OCR-VQGAN (Rodriguez et al. 2023)', 'LLa MA-2 7B model', 'Lo RA (Hu ets al. 2022)', and 'Adam (Kingma and Ba 2014) optimizer'. However, none of these are accompanied by specific version numbers for software dependencies. |
| Experiment Setup | Yes | PATENTMME: ...the weights of the multimodal transformer remain frozen and only the loss heads are trained for 1 epoch with a higher learning rate of 1e-3 and 1K warm-up steps... During Step 2, the entire model is trained end-to-end for 8 epochs with a lower learning rate of 5e-5 and with 10K warm-up steps. The PATENTMME model is trained on 8 V100 GPUs, with an effective batch size of 64 and Adam (Kingma and Ba 2014) optimizer. PATENTLMM: ...train our PATENTLMM with an effective batch size of 192 on 3 A100 GPUs (40 GB). Stage 1 training progresses at a higher learning rate of 1e-3, and stage 2 training takes place at a learning rate of 2e-4 with a cosine schedule, for 12K steps using Adam optimizer. |