Multiple sequence alignment as a sequence-to-sequence learning problem

Authors: Edo Dotan, Yonatan Belinkov, Oren Avram, Elya Wygoda, Noa Ecker, Michael Alburquerque, Omri Keren, Gil Loewenthal, Tal Pupko

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach leads to alignment accuracy that is similar and often better than commonly used methods, such as MAFFT, DIALIGN, Clustal W, T-Coffee, PRANK, and MUSCLE. Our analyses demonstrate that Beta Align has comparable, and in some cases superior, accuracy compared to the most popular MSA algorithms: T-Coffee (Notredame et al., 2000), Clustal W (Larkin et al., 2007), DIALIGN (Morgenstern, 2004), MUSCLE (Edgar, 2004), MAFFT (Katoh & Standley, 2013) and PRANK (L oytynoja & Goldman, 2008). We compared the performance of Beta Align to the state-of-the-art alignment algorithms: Clustal W (Larkin et al., 2007), DIALIGN (Morgenstern, 2004), MAFFT (Katoh & Standley, 2013), T-Coffee (Notredame et al., 2000), PRANK (L oytynoja & Goldman, 2008), and MUSCLE (Edgar, 2004). For each number of sequences from two to ten, the performance was compared on a simulated test dataset comprising 3,000 nucleotide MSAs (Fig. 4a).
Researcher Affiliation Academia Edo Dotan1 , Yonatan Belinkov2 *, Oren Avram3, Elya Wygoda1, Noa Ecker1, Michael Alburquerque1, Omri Keren1, Gil Loewenthal1, and Tal Pupko1 1 Tel Aviv University 2 Technion Israel Institute of Technology 3 University of California Los Angeles
Pseudocode No The paper describes the approach and methods in textual form and through figures, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using and modifying the Fairseq library, but it does not provide a link or an explicit statement about releasing the source code for Beta Align itself.
Open Datasets No The paper states, "We simulated the training alignments using Sparta ABC (Loewenthal et al., 2021)." and describes the parameters used for simulation, but it does not provide concrete access information (link, DOI, repository, or formal citation for a pre-existing public dataset) for the generated datasets themselves. The data is generated by the authors, not a publicly available external dataset.
Dataset Splits Yes Six text files are used as input for the pre-processing step before actually running the transformer: unaligned sequences and true MSAs, for training, validation, and testing data. We generated 395,000 and 3,000 protein MSAs with ten sequences that were used as training and testing data, respectively. For the dataset used in Fig. 5, we generated 395,000 and 3,000 protein MSAs with five sequences, which were used as training and testing data, respectively.
Hardware Specification Yes All model evaluations are executed on GPU machines, Tesla V100-SXM2-32GB.
Software Dependencies No The paper mentions using "Fairseq library (Ott et al., 2019)" which implies a version, but it does not provide an explicit version number for Fairseq itself or for Python. It lists versions for comparison tools (e.g., MUSCLE v3.8.1551, MAFFT v7.475), but not for the specific software components of their own implementation.
Experiment Setup Yes We first assessed various transformer configurations, which differ in their training parameters: max tokens, learning rate and warmup updates. The learning rate and warmup values for both transformers are 5E-5 and 3,000, respectively. The max token parameter values are 4,096 and 2,048 for the original and alternative transformers, respectively. We used label-smoothed cross entropy (Szegedy et al., 2015) to compute the loss of the model with a dropout (Srivastava et al., 2014) rate of 0.3. We used Adam (Kingma & Ba, 2014) as the optimizer of the model with 0.9 forgetting factors for gradients and 0.98 for the and second moments of gradients.