AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Authors: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks including Human Eval and MBPP. In our experiments, AST-T5 consistently outperforms baselines in code generation, transpilation, and understanding tasks. Through controlled experiments, we empirically demonstrate that these advancements are attributed to our AST-aware pretraining techniques. |
| Researcher Affiliation | Collaboration | 1Department of EECS, University of California at Berkeley, Berkeley, California, USA 2AI at Meta, USA. Correspondence to: Linyuan Gong <gly@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Dynamic Programming in AST-Aware Segmentation and Algorithm 2 Subtree Selection in AST-Aware Subtree Corruption |
| Open Source Code | Yes | Our code and model are publicly available at https://github.com/ gonglinyuan/ast t5 . |
| Open Datasets | Yes | AST-T5 is pretrained on a subset of The Stack Dedup corpus (Kocetkov et al., 2022) |
| Dataset Splits | No | The paper states it finetunes on training datasets and evaluates on test sets, but does not provide specific train/validation/test split percentages or sample counts for each dataset. For example, 'We finetune AST-T5 on the training datasets of all downstream tasks' and 'For the Human Eval task, which lacks its own training dataset, we use Code Search Net (Husain et al., 2020)' without specifying split details. |
| Hardware Specification | Yes | Pretraining uses Py Torch, Fairseq2 and Flash Attention (Dao et al., 2022) and is conducted on 8 nodes, each with 8x NVIDIA A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch, Fairseq2 and Flash Attention (Dao et al., 2022)' and 'cl100k base byte-level BPE vocabulary from tiktoken3' but does not provide specific version numbers for these software dependencies. URLs are provided for Fairseq and Tiktoken but not specific versions. |
| Experiment Setup | Yes | Table 1: Pretraining hyperparameters for AST-T5. Encoder Layers 12, Decoder Layers 12, Hidden Dimension 768, Peak Learning Rate 2e-4, Batch Size 1,024, Warm-Up Steps 10,000, Total Steps 500,000, Sequence Length 1,024, Mask Ratio 25%, Min Subtree Corruption Threshold θ 5, Max Subtree Corruption Threshold θ 100, Min Text Corruption Span Length 1, Max Text Corruption Span Length 10, Relative Position Encoding Buckets 32, Relative Position Encoding Max Distance 128, Adam ϵ 1e-6, Adam (β1, β2) (0.9, 0.98), Clip Norm 2.0, Dropout 0.1, Weight Decay 0.01. |