DocEdit: Language-Guided Document Editing

Authors: Puneet Mathur, Rajiv Jain, Jiuxiang Gu, Franck Dernoncourt, Dinesh Manocha, Vlad I. Morariu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also propose Doc Editor, a Transformer-based localizationaware multimodal (textual, spatial, and visual) model that performs the new task. The model attends to both document objects and related text contents which may be referred to in a user edit request, generating a multimodal embedding that is used to predict an edit command and associated bounding box localizing it. Our proposed model empirically outperforms other baseline deep learning approaches by 15-18%, providing a strong starting point for future work.
Researcher Affiliation Collaboration Puneet Mathur1*, Rajiv Jain2, Jiuxiang Gu2, Franck Dernoncourt2, Dinesh Manocha1, Vlad I. Morariu2 1University of Maryland, College Park 2Adobe Research 1{puneetm, dmanocha}@umd.edu 2{rajijain,jigu,dernoncourt,morariu}@adobe.com
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code No No explicit statement about releasing source code for the methodology, nor a link to a code repository, was found.
Open Datasets Yes In support of this task, we curate the Doc Edit dataset, a collection of approximately 28K instances of user edit requests over PDF and design templates along with their corresponding ground truth software executable commands. We extracted 20K anonymized PDF documents from the publicly available Enron corpus and Hierarchical forms (Aggarwal et al. 2020) datasets with all personally identifiable information (PII) removed for Doc Edit PDF. We downloaded 12K publicly available and freely distributed design templates from the Adobe Express platform for Doc Edit-Design.
Dataset Splits Yes Data Splits: We split both Doc Edit-PDF and Doc Edit-Design into train, validation, and test in the ratio of 70:10:20.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided.
Software Dependencies No No specific version numbers for software dependencies (e.g., libraries, frameworks, or programming languages) were provided, only references to papers describing tools or models (e.g., T5, Sentence Piece, Fast Text).
Experiment Setup Yes Command Generation Loss: For generating the textual part of the desired output command, we utilize the pre-trained weights of T5... We finetune the backbone Transformer architecture using standard maximum likelihood, i.e. using teacher forcing (Williams and Zipser 1989) and a cross-entropy loss... Bounding Box Regression Loss: ...We utilize a weighted sum of the scale-invariant generalized Io U loss (GIo U) (Rezatofighi et al. 2019) and the smooth L1 loss... λ is a hyperparameter. Anchor Box Prediction Loss: ...We optimize the anchor prediction as an auxiliary task through the binary cross-entropy loss... Multitask Training: ...The final optimization uses a weighted sum of Lgen, Lreg, Lanchor such that total loss L = λ1Lgen+λ2Lreg +(1 λ1 λ2)Lanchor, where the weighting factors λ1, λ2 are hyperparameters.