An alternative to structural conservation is homology meant in a phylogenetic sense. In that case, the alignment of two residues is a statement that these two residues share a similar relation to their closest common ancestor. Aside from evolutionary inertia, there is no well defined reason why a structure and a homology-based alignment of the same sequences should be identical. Likewise, in a functionally correct alignment, residues having the same function need to be aligned, even if their similarity results from convergent evolution.
Overall, a multiple sequence alignment is merely a way of confronting and organizing the specific data one is interested in. Until recently, this notion was mostly theoretical, sequence information being almost the only available data. Alignments were optimized for their sequence similarity, in the hope that this one-size-fits all approach would fare well for most applications.
The situation has now changed dramatically and the amount of data that could be integrated when building an MSA is rising by the day. It includes new sequences coming from large scale genome sequencing, with a density of information that will make it more and more possible to reconstruct evolutionary correct alignments Frazer et al.
Other high-throughput-based projects are delivering functional data in the form of transcript structure Birney et al. Another ongoing trend is the delivery of large-scale functional data, resulting from to the use of robotic techniques.
These make it possible to gather large amounts of functional information associated with homologous sequences Fabian et al. This data is usually applied to Quantitative Structure and Activity Relationships analysis, but it could just as well be used when comparing protein sequences. These trends have not gone un-noticed and over the last years, regular efforts have been made at developing and improving multiple sequence alignments methods so that they could take advantage of newly available data.
Three areas have been actively explored: i accuracy improvement, achieved through the use of consistency-based methods Do et al.
Most of the MSA methods currently available have been described and compared at length in several very complete reviews Edgar and Batzoglou, ; Notredame, ; Pei ; Wallace et al. In this specific review, we will mostly focus on the latest developments in an attempt to identify the main trends of this very active research field.
We will also discuss an important challenge: the development of adequate benchmarking techniques, informative enough with respect to all potential applications of MSA methods, especially the reconstruction of accurate phylogenies. The urgency of this issue recently received a striking illustration with two high-impact papers dealing with the complex relationship that exist between MSA reconstruction and accurate phylogenetic estimation Loytynoja and Goldman, ; Wong et al.
Multiple sequence alignment computation stands at a cross-road between computation and biology. The computational issue is as complex to solve as it is straightforward to describe: given any sensible biological criterion, the computation of an exact MSA is NP-Complete and therefore impossible for all but unrealistically small datasets Wang and Jiang, MSA computation therefore depends on approximate algorithms or heuristics and it is worth mentioning that almost every conceivable optimization technique has been adapted into a heuristic multiple sequence aligner.
The biological issue surrounding MSAs is even more complex: given a set of sequences, we do not know how to estimate similarity in a way that will guaranty the biological correctness of an alignment, whether this correctness is defined in evolutionary, structural or functional terms. In fact, one could argue that being able to compare the biological features coded by a DNA sequence implies having solved most of the ab initio problems associated with genetic information interpretation, including protein structure prediction.
But, these problems are not solved and in practice multiple alignments are estimated by maximizing identity, in the hope that this simplistic criterion will be sufficiently informative to yield models usable for most type of biological inference. The objective function thus maximized is usually defined with a substitution matrix and a gap penalty scheme. The substitution matrix is relatively sophisticated when it comes to proteins, but barely more than an identity matrix for DNA and RNA.
For a long time, the maximization was carried out using a combination of dynamic programming DP and log odds scoring scheme, but over the last year, Bayesian techniques have been implemented that rely on pair-Hidden Markov Models HMMs and take advantage of a better defined statistical framework Durbin et al.
While DP- and HMM-based approaches are mostly interchangeable, the last ones make it easier to explore the parameter space using off-the-shelves statistical tools such as Baum—Welch and Viterbi training. HMM modeling also offers easy access to a wider range of scoring possibilities, thanks to posterior decoding, thus making it possible to assess complex alignment scoring schemes. For instance, a significant share of the improvements measured in the ProbCons Do et al.
Sequence identity is only a crude substitute to biological homology, and in practice, it has often been argued that structurally correct alignments are those more likely to be useful for further biological modeling. This tuning or validation has relied on the systematic usage of structure-based reference multiple sequence alignments. This procedure has now been in use for more than a decade and has been a major shaping force on this entire field of research.
We will now review the most common validation procedures with their associated databases. Benchmarking of a selection of methods on the RV11 Balibase dataset. All packages were ran using the default parameters. Servers were ran in August The first systematic validation of a multiple sequence alignment using reference alignments was carried out by McClure McClure was evaluating her alignments by assessing the correct alignment of pre-defined functional motifs.
Shortly after, Notredame and Higgins made the first attempt to systematically use structure-based alignments while evaluating the biological accuracy of the SAGA package. The validation was carried out on a relatively small dataset named 3D-ali Pascarella et al.
The main specificity of BaliBase was to address a wide range of different issues related to multiple sequence alignments. Its main weakness was the questionable accuracy of some alignments and the relatively small size 82 of the dataset.
Most of these issues have been addressed in the latest version of BaliBase BaliBase 3 Thompson et al. Nonetheless, BaliBase remains a handmade dataset, with potential arbitrary and uneven biases resulting from human intervention. Prefab, however, is not a multiple sequence alignment collection since each dataset only contains a pair of structures thus making it a less stringent than BaliBase where accuracy can be tested on entire multiple alignment columns rather than pairs of residues.
Other commonly used databases for protein multiple sequence alignments include OXBench Raghava et al. One may ask why so many resources for addressing an apparently simple question. The answer probably lies in the complexity of structural alignments.
While reasonably accurate structure-based alignments are easy enough to generate, owing to the strength of the structural signal, it is nonetheless very hard to objectively assess the relative merits of alternative structure-based alignments Kolodny et al. Several alternative references are therefore available and no simple way exists to objectively evaluate their relative merits. In practice, the authors have taken the habit of running their methods on two or three datasets, verifying trend agreement.
Recently, Blackshield and Higgins Blackshields et al. The main trend uncovered by this analysis is that all the empirical reference datasets tend to yield similar results, quite significantly distinct from those measured on artificial datasets such as IRMbase Subramanian et al. We checked by re-analyzing some of the Blackshield and Higgins benchmark data Table 2 in the context of this review.
The methodology is very straightforward: each reference dataset is divided in sub-categories, and altogether the six datasets make a total of 43 sub-categories 34 for the empirical datasets, 9 for the artificial. Given two MSA methods A and B, we counted how many times the ranking suggested by one sub-category is in agreement with the ranking suggested by another sub-categories agreement in Table 2.
We then compared all the sub-categories of a dataset against all the sub-categories of the other datasets and reported the average figure in Table 2.
We also computed the average agreement within every dataset by measuring the agreement across different categories within a dataset. The results on Table 2 suggest that the five main empirical datasets are on average It means that any prediction of accuracy made on the basis of a single reference dataset is likely to be supported by A striking observation is the lower agreement between the artificial dataset IRMdb and the empirical ones.
Observations made on IRMdb are on average only supported by Two factors could explain this discrepancy: the local nature of IRMdb, mostly designed for assessing local alignment capacities, or its artificial nature. The fact that empirical datasets biased toward local similarity BaliBase RV50, long indels, Furthermore, at least three other studies reported similar findings, with results established on artificial datasets conflicting with empirical ones Lassmann and Sonnhammer, , b ; Loytynoja and Goldman, This table shows a new analysis of the original data.
The agreement is defined as the number of times two given databases sub-categories agree on the relative accuracy of two methods. The last two rows show the average agreement between all respectively all empirical datasets. While there is no clear consensus on this matter, we would argue here that the discrepancy between artificial and empirical datasets pleads in favor on not using the artificial ones.
The use of artificial dataset should probably be restricted to situations where the process responsible for the sequence generation is well known and properly modeled, as happens in sequence assembly for instance.
It is interesting to note that some sub-categories of BaliBase are extremely informative albeit relatively small. RV11 for instance is So far, this dataset has proven fairly resistant to heavy tuning and over-fitting and it is a striking observation that ProbCons, the only package explicitly trained on BaliBase is not the most accurate as shown on Table 1.
Table 1 shows a systematic benchmarking of most methods discussed here on the RV11 dataset. Results are in broad agreement with those reported in most benchmarking studies published over these last 10 years, but the challenging nature of the dataset makes it easier to reveal significant difference in accuracy that are otherwise blurred by other less challenging datasets.
BaliBase has had a strong influence on the field, prompting the design of novel reference datasets for sequences other than proteins. BraliBase works along the same lines as BaliBase and relies on a comparison between an RNA alignment and its structure-based counterpart.
There is, nonetheless, a clear difference between these two reference datasets: in BraliBase, the reference structures are only predicted, and the final evaluation combines a comparison with the reference and an estimation of the predictive capacity of the new alignment.
As such, BraliBase is at the same time more sophisticated than BaliBase because it evaluates the prediction capacity of the alignment and less powerful because it is not based on a sequence-independent method unlike BaliBase that uses structural comparison. This limitation results from the relative lack of RNA 3D structures in databases.
We will see in the last section of this review that the current benchmarking strategies have many short comings and cannot address all the situations relevant to MSA evaluation. These methods have nonetheless been used to validate all the currently available multiple sequence alignment packages and can certainly be credited or blamed … for having re-focused the entire methodological development toward the production of structurally correct alignments.
Well standardized reference datasets have also gradually pushed the MSA field toward becoming a fairly codified discipline, where all contenders try to improve over each other's methods by developing increasingly sophisticated algorithms, all tested in the same arena. Given the increased accuracies reported these last years, one may either consider the case closed or suspect that time has come to change arena.
With the exception of POA Lee et al. This popular MSA assembly algorithm is a straightforward agglomerative procedure. Sequences are first compared two by two in order to fill up a distance matrix, containing the percent identity.
The agglomerative algorithm follows the tree topology thus defined and works its way from the leaf to the root, aligning two by two each sequence pair or profile associated with each encountered node. The procedure can be applied using any algorithm able to align two sequences or two alignments.
In most packages, this algorithm is the Needleman and Wunsch or more recently the Viterbi algorithm Durbin et al. As simple as it may seem, the progressive alignment strategy affords many possible adjustments, the most notable ones being the tree computing algorithm, the sequence weighting method and the gap weighting scheme. In recent work Wheeler and Kececioglu, , the authors have shown that a proper tuning of these various components can take a standard method up to the level of the most accurate ones.
ClustalW Thompson et al. It is a bit paradoxical since its implementation of the progressive alignment significantly differs from the canonical one, in that it delays the incorporation of the most distantly related sequences until the second and unique iteration. This delaying procedure was incorporated in ClustalW in order to address the main drawback of the progressive alignment strategy: the greediness.
When progressing from the leaves toward the root, a progressive aligner ignores most of the information contained in the dataset, especially at the early stage.
Whenever mistakes are made on these initial alignments, they cannot be corrected and tend to propagate in the entire alignment, thus affecting the entire process. With a large number of sequences, the propagation and the resulting degradation can have extreme effects. This is a well known problem, usually addressed via an iterative strategy. In an iterative scheme, groups of sequences are realigned a certain number of time, using either random splits or splits suggested by the guide tree.
The most sophisticated iterative strategies [incorporated in Muscle and PRRP Gotoh, ], involve two nested iterative loops, an inner one in which the alignment is optimized with the respect to a guide tree, and an outer one in which the current MSA is used to re-estimate the guide tree. The procedure keeps going until both the alignment and the guide tree converge. It was recently shown that these iterations almost always improve the MSA accuracy Wallace et al.
The greediness of progressive aligners limits their accuracy, and even when using sophisticated iteration schemes, it can be very hard to correct mistakes committed early in the alignment process.
In theory, these mistakes could easily be avoided if all the information contained in the sequences was simultaneously used. Unfortunately, this goal is computationally unrealistic, a limitation that has prompted the development of consistency-based methods. In their vast majority, algorithms based on consistency are also greedy heuristics with the exception of the maximum weight trace MWT problem formulation of Kececioglu , but even so, they have been designed to incorporate a larger fraction of the available information at a reasonable computational cost.
The use of consistency for improved alignment accuracy was originally described Gotoh and later refined by Vingron and Argos Kececioglu provided an exact solution to this problem, reformulated as a MWT problem. This exact approach is limited to small datasets but was further expanded by Morgenstern who proposed the first heuristic to solve this problem for large instances, thanks to the concept of overlapping weights Morgenstern et al.
While the notions developed in these four approaches are not totally identical, they have in common the idea of evaluating pair-wise alignments through the comparison of a third sequences i. In practice, Gotoh did not use consistency to construct alignments, but rather to evaluate them, and only considering three sequences. The consistency described by Vingron is very strict because it results from dot-matrices multiplications, therefore requiring strict triplet consistency in order to deliver an alignment.
The overlapping weights described by Morgenstern also involve considering the support given by an intermediate sequence to a pair-wise alignment, but in this context, the goal is to help guiding the incorporation of pair-wise segments into the final MSA. While the overlapping weights bear a strong resemblance to the most commonly used definition of consistency, it is important to point out that Morgenstern also uses the term consistency but gives it a different meaning to describe the compatibility of a pair of matched segments within the rest of a partly defined multiple sequence alignments.
The first combination of a consistency-based scoring scheme with the progressive alignment algorithm was later developed in the T-Coffee package Notredame et al. The main feature of a consistency-based algorithm is its scoring scheme, largely inspired by the Dialign overlapping weights. Regular scoring schemes are based on a substitution matrix, used to reward identities and penalize mismatches. In a consistency-based algorithm, the reward for aligning two residues is estimated from a collection of pair-wise residue alignments named the library.
Given the library, any pair of residues receives an alignment score equal to the number of time these two residues have been found aligned, either directly or indirectly through a third residue Fig. The indirect alignments are estimated by combining every possible pair of pair-wise alignments i.
Each observation can be weighted with a score reflecting the expected accuracy of the alignment on which the observation was made. In the original T-Coffee, the residue pairs contained in the library were generated using a global ClustalW and a local Lalign method applied on each pair of sequences.
At the time, the T-Coffee protocol resulted in a significant improvement over all alternative methods. This protocol was later brought into a probabilistic framework with the package ProbCons. A posterior HMM decoding of this HMM is then used to identify the high-scoring pairs that are incorporated in the library, using their posterior probability as a weight.
The library is then used to score the alignment with the T-Coffee triplet extension. Interestingly, the improvement is usually considered to be a consequence of the probabilistic framework when in fact it seems to result mostly from the use of a more appropriate gap penalty scheme at the pair-wise level.
For instance, Table 1 shows the effect of applying a regular gap penalty scheme monophasic when compared with the bi-phasic gap penalty scheme that ProbCons uses by default.
This improvement has also been observed when incorporating the bi-phasic scheme in T-Coffee. The SPS score is the percentage of pairs of aligned positions from the reference alignment that were correctly recovered; the TC score is the percentage of columns from the reference alignment that were completely recovered. The performance of Opa l is shown both using an advisor method for choosing input-dependent values for gap penalties for scoring alignments, and using input-independent default values.
The highest accuracies for a suite are in bold. As these results show, Opal is in the class of state-of-the-art tools. Note that Opal achieves this without using alignment consistency Notredame et al. Opal has not yet been optimized for speed. Its running time is about two orders of magnitude slower than ClustalW and Muscle , and about the same order of magnitude as the slowest tool, T-Coffee.
Over all benchmarks, the median run time for Opal was less than 10 seconds, which was on an input of 20 sequences of length about We have presented a careful study of methods for each stage of the form-and-polish strategy for multiple alignment.
This includes new methods for estimating distances, merging alignments, weighting pairs of sequences, polishing the alignment, and choosing parameters. Our new weighting method is easy to implement, fast to evaluate, and avoids the anomalies of current approaches, but under standard measures of benchmark recovery it has little effect on accuracy.
A new merging method that optimally aligns alignments yields only a small improvement over an approximate merging heuristic. The best method for a stage is generally the same across all suites of benchmarks, suggesting that what we have identified as best has not been overfitted to the data. The outcome of this study is a new alignment tool, Opal.
By combining the best methods, Opal attains accuracy on par with the state-of-the-art namely ProbCons and MAFFT without altering the alignment scoring function by increasing gap penalties near hydrophobic regions, or through position-specific substitution scores based on alignment consistency.
Adding hydrophobicity and consistency should give even greater accuracy. Incorporating hydrophobic gap penalties, and alignment consistency into the exact algorithm for aligning alignments might boost the recovery substantially. Properly assessing how weighted sum-of-pairs scoring affects alignment quality by devising an unbiased recovery measure that takes overrepresentation into account appears challenging.
Finally, our experiments with an oracle for choosing parameter values suggest large gains in recovery may be possible by an input-dependent choice of parameters. We also thank Eagu Kim for assistance in finding initial parameter values by inverse alignment. Google Scholar. Google Preview. Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide.
Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation.
Volume Article Contents Abstract. Multiple alignment by aligning alignments. Wheeler , Travis J. Oxford Academic. John D. Cite Cite Travis J. Select Format Select format. Permissions Icon Permissions. Abstract Motivation: Multiple sequence alignment is a fundamental task in bioinformatics.
Let G be the current set of groups during the merging process, and d a b be the current distance between a pair a , b of groups. NJ merges the groups a and b that minimize. Tree method. MST Open in new tab. Distance method. Normalized cost Merge method. Exact Each scheme assigns a weight w ij to a pair i, j of leaves in T. Suppose that in a multiple alignment of the sequences, the score of the induced pairwise alignment of the sequences associated with leaves i and j is s ij. The weighted sum-of-pairs score of alignment using these weights is.
One way to assign a weight w ij to a pair i, j of leaves is to quantify the influence of one leaf on another on the basis of the shape of T and its edge lengths.
Then we can define a symmetric weight on a pair of leaves by the geometric mean of their influences:. Imagine taking leaf i , making i the new root of T and letting T hang from root i. We denote this rerooted tree with root i by T i. Starting from root i , we process T i top-down, splitting the total mass of weight 1 in the above equation among the descendants of i.
The new root i has exactly one child which was originally the parent of i in T , and this child receives the entire mass of weight 1 passed down from its parent i. In general, if a descendant x receives mass w x from its parent, this mass is split among its two children y and z possibly unequally so that.
For nodes v and w of T , denote the path in T between v and w by P vw. For a node v , let T u v be the subtree of T u rooted at node v together with the path P uv. The total size of T u v is. Weighting method. Uniform For edge e , let P e be the set of pairs of sequences that are on opposite sides of the partition given by cutting e , let c ij be the cost of the pairwise alignment induced on sequences i and j in the current multiple alignment, and let d ij be the cost of their optimal pairwise alignment.
We use the potential. Polishing method. We then enumerated a range of penalties around this seed, including variants with reduced terminal gap costs.
In total, we examined around parameter choices. From this set of parameters we selected as our default the choice that had the highest recovery over all three suites of benchmarks. The parameter choice that this advisor selects based on core columns is. Below, we compare the hypothetical accuracy of the oracle to that achieved using default parameters and the advisor.
A smaller set of four parameter choices was identified by selecting the subset of size four that gave the best recovery under the advisor. Parameter method. Oracle on Baseline Table 1. MAFFT Google Scholar Crossref. Search ADS. PALI: a database of alignments and phylogeny of homologous protein structures.
A novel randomized iterative strategy for aligning multiple protein sequences. Google Scholar PubMed. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Further improvement in methods of group-to-group sequence alignment with generalized profile operations. A weighting system and algorithm for aligning many phylogenetically related sequences.
Comprehensive study on iterative algorithms of multiple sequence alignment. Version 0. SCOP: a structural classification of proteins database for the investigation of sequences and structures. The neighbor-joining method: a new method for reconstructing phylogenetic trees. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Van Walle. Align-m: a new algorithm for multiple alignment of highly divergent sequences. Sequence alignment and penalty choice: review of concepts, case studies, and implications. Issue Section:. Download all slides. View Metrics. Email alerts Article activity alert. Advance article alerts.
New issue alert. Receive exclusive offers and updates from Oxford Academic. Related articles in Web of Science Google Scholar. Citing articles via Web of Science Latest Most Read Most Cited Deep graph representations embed network information for robust disease marker identification. T2-DAG: a powerful test for differentially expressed gene pathways via graph-informed structural equation modeling.
Deep graph learning of inter-protein contacts. Looking for your next opportunity? The solution of the above partial difference equation is given by. This relation seems to be new in this form. Here, the generalized hypergeometric series is defined as see e.
If one of the parameters a j equals to a negative integer, then the sum becomes a terminating series. In this case, the recurrence relation for the h n , m coefficients is [ 11 ]. Therefore, the generating function [ 17 , 18 ] is. As indicated before, the main aim of this paper is to give an explicit representation in this case.
The recurrence relation for the g n , m coefficients is [ 11 ]. Thus, the generating function [ 17 , 18 ] is. This last inequality is relevant since 10 80 is an estimation of the number of protons of our universe [ 13 ].
We conclude also that our approach gives an explicit formula filling a gap in the theory of sequence alignment. It may be used also, in the future, to get explicit formulas and compute the number of total, reduced, and effective alignments for multiple sequences. We would like to mention that our approach is amazingly fast, since e.
Chapter Google Scholar. J Supercomput. Article Google Scholar. Zhong C, Zhang S: Efficient alignment of rna secondary structures using sparse dynamic programming.
BMC Bioinformatics. Chaisson M, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement blasr : theory and application. Evolutionary Genomics. Methods in Molecular Biology.
Edited by: Anisimova M. Google Scholar. Lesk AM: Introduction to Bioinformatics. Appl Math Lett. Eger S: Sequence alignment with arbitrary steps and further generalizations, with applications to alignments in linguistics. Inform Sci. Morgenstern B: A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Ellis Horwood Series: Mathematics and its Applications.
Wilf HS: Generatingfunctionology.
0コメント