Alfree is a web application for the real-time phylogeny reconstruction using alignment-free sequence comparison methods. In contrast to alignment-producing programs, these methods calculate distances between sequences by discovering various patterns and properties in unaligned sequences. The service integrates 38 popular alignment-free approaches as well as many functionalities such as on-the-fly tree construction and visualization (e.g. distance matrices, phylogeny) and creating consensus trees. Alfree is also freely available for downloading as a stand-alone tool (alfpy) to use on your computer and can work as a Python package that can be easily integrated into existing applications.
Alfree is a fast-growing software in terms of alignment-free methods it handles. At present, it includes 27 word-based distance methods, 9 Information Theory-based methods (they compute the amount of information that is tored in sequences) as well as 2 hybrid methods (they combine different approaches such as Kullback-Leibler divergence or W-metric). You can find the full list of methods implemented in Alfree by clicking at 'METHODS' in the menu.
You just need to provide a set of nucleotide/protein sequences of interest (plain/FASTA format) and you are fine. Additionally you can specify the methods you want to perform (by default, Alfree selects the most accurate methods). Once the sequences have been successfully submitted, the web server will directly return you back the computing results. These results contain:
consensus phylogenetic tree (agreement between trees generated by different methods)
trees generated from each method
distance matrices presented as heat maps
all-versus-all sequence pair-wise distances in searchable and sortable table.
Alignment methods have been widely used for the establishing of evolutionarly relationships among genomic sequences. However, there are several drawbacks of these approaches, which can be compensated by alignment-free methods implemented in Alfree.
Alignment scores are particularly useful when sequences are known to be closely homologous since the more conserved regions are automatically detected. However, for remote homologues this approach tends to fail: for example, in case of protein sequences, the problem is that the alignment becomes totally unreliable as the protein similarity enters the "twilight zone", i.e. the subspace of sequence sharing not more than 20% sequence identity. Our benchmark showed that alignment-free methods are better or as good as alignment algorithm (i.e. Smith-Waterman) when sequence similarity is smal, such as for recognition of fold or class relationships.
Alignment-producing programs assume that homologous sequences comprise a series of linearly arranged, more or less conserved sequence stretches. However, this assumption is often violated in the real world. A good example are viral genomes, which exhibit great variation in number and order of genetic elements. Consequently, the alignment approach overlooks well-documented long-range interactions and general fluidity resulting from recombination with shuffling of conserved segments without loss of function.
Alignment-based approaches are generally memory- and time-consuming and thus, are of limited use with multi-genome-scale sequence data. The computational complexity boosts exponentially with sequence size. Despite the wealth of tools and more than 15 years of research, the problem of long sequence alignment is not yet fully resolved. Because available sequence evolutionary models do not directly apply to complete genomes, the existing programs use only partial information (e.g. gene order, break points, and segment copying)
The computation of an accurate multiple sequence alignment (MSA) is a NP-hard problem, which means that it can not be solved in realistic time; a situation that explains why over 100 alternative faster methods have been developed these past three decades. However, the speed optimization does not come without ‘cost’ because these techniques rely on various shortcuts (heuristics) that do not guarantee identification of the optimal, highest scoring alignment and often results in inaccuracies that limit the quality of many downstream analyses (e.g. phylogenetic).
Sequence alignment depends on multiple a priori assumptions about evolution of sequences that are being compared. These various parameters (e.g. substitution matrices, gap penalties, threshold values for statistical parameters) are somewhat arbitrary. What is worse, the scoring system is not consensual between applications and there are many reports showing that small changes in input parameters can greatly affect the alignment. In spite of the awareness of the problem it is still not understood how to choose alignment parameters rationally. To make matters worse, the reference substitution matrices are often used without verifying if they are representative of the sequences being aligned. Intriguingly, BLOSUM matrices - most commonly used substitution matrix series for protein sequence alignments - were found to have been miscalculated years ago and produce significantly better alignments than their corrected modern version RBLOSUM22; and this paradox still remains a mystery.
Alfree accepts nucleotide/protein sequences in FASTA format. On-line version of Alfree lets you compare not more than 50 sequences and the sum of sequence lengths cannot exceed 200,000 nucleotides/amino acids. However, there are no such limits in stand-alone Alfpy program - you can compare sequences even among whole eukaryotic genomes.
In the main results screen, you will see a consensus tree, computed as an agreement among different methods included in this analysis. The numbers at the tree's nodes fall from 0 to 1 and tell you how well a given node is supported by other methods. You can browse trees generated by individual methods using right sidebar. All tree visualizations are interactive - for example, you can search nodes or change the layout of the tree from phylogram to cladogram or from vertical to radial.
By clicking at 'Distance' tab you can browse distance measures between sequences in graphical form (as interactive heat map) or textual (as searchable and sortable table).
Once your sequences have been submitted, Alfree calculates all-versus-all pairwise distances among sequences. These distances for all sequences pairs comprises a distance matrix. Once a distance matrix is obtained, the tree construction goes in the standard way, i.e. the sequences are hierarchically clustered by providing this distance matrix as an input to the neighboring joining program (neighbor/fneighbor) of PHYLIP/EMBOSS software package (neighbor joining algorithm with default parameters, branch lengths not specified).
Consensus tree is calculated using fconsense program (EMBOSS package) with the majority-rule consensus tree method.
On-line version of Alfree allows for analyses of no more than 50 sequences. In addition, the sum of the user's sequence lengths cannot exceed 200,000 nucleotides/amino acids. If you work with larger sets of sequences, we recommend using the stand-alone version of Alfree, where no such limits stand.
You need to have Python installed, preferable version 2.7 or higher (available from http://www.python.org/) and popular NumPy package - a fundamental package for scientific computing with Python (available from http://numpy.scipy.org/). To see how to install Alfpy see 'DOWNLOAD' > 'SOFTWARE'.