AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Authors: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks including Human Eval and MBPP. In our experiments, AST-T5 consistently outperforms baselines in code generation, transpilation, and understanding tasks. Through controlled experiments, we empirically demonstrate that these advancements are attributed to our AST-aware pretraining techniques.
Researcher Affiliation Collaboration 1Department of EECS, University of California at Berkeley, Berkeley, California, USA 2AI at Meta, USA. Correspondence to: Linyuan Gong <gly@berkeley.edu>.
Pseudocode Yes Algorithm 1 Dynamic Programming in AST-Aware Segmentation and Algorithm 2 Subtree Selection in AST-Aware Subtree Corruption
Open Source Code Yes Our code and model are publicly available at https://github.com/ gonglinyuan/ast t5 .
Open Datasets Yes AST-T5 is pretrained on a subset of The Stack Dedup corpus (Kocetkov et al., 2022)
Dataset Splits No The paper states it finetunes on training datasets and evaluates on test sets, but does not provide specific train/validation/test split percentages or sample counts for each dataset. For example, 'We finetune AST-T5 on the training datasets of all downstream tasks' and 'For the Human Eval task, which lacks its own training dataset, we use Code Search Net (Husain et al., 2020)' without specifying split details.
Hardware Specification Yes Pretraining uses Py Torch, Fairseq2 and Flash Attention (Dao et al., 2022) and is conducted on 8 nodes, each with 8x NVIDIA A100 40GB GPUs.
Software Dependencies No The paper mentions 'Py Torch, Fairseq2 and Flash Attention (Dao et al., 2022)' and 'cl100k base byte-level BPE vocabulary from tiktoken3' but does not provide specific version numbers for these software dependencies. URLs are provided for Fairseq and Tiktoken but not specific versions.
Experiment Setup Yes Table 1: Pretraining hyperparameters for AST-T5. Encoder Layers 12, Decoder Layers 12, Hidden Dimension 768, Peak Learning Rate 2e-4, Batch Size 1,024, Warm-Up Steps 10,000, Total Steps 500,000, Sequence Length 1,024, Mask Ratio 25%, Min Subtree Corruption Threshold θ 5, Max Subtree Corruption Threshold θ 100, Min Text Corruption Span Length 1, Max Text Corruption Span Length 10, Relative Position Encoding Buckets 32, Relative Position Encoding Max Distance 128, Adam ϵ 1e-6, Adam (β1, β2) (0.9, 0.98), Clip Norm 2.0, Dropout 0.1, Weight Decay 0.01.