The performance of 33 alignment-free measures (methods list: ) was assessed by calculating areas under the ROC curves (AUC) with reference to SCOPe database (Structural Classification Of Proteins) (1). The implementations of the alignment-free methods used in this study are available as a Python package. The accuracy and speed of alignment-free measures were also tested against Smith-Waterman algorithm implemented in water program (from EMBOSS package).
SCOP Database Overview
The SCOP database consists of Protein Data Bank (PDB) entries and provides a detailed and reliable description of protein structure relationships. The three-dimensional (3D) structure analysis allows the detection of more remote homologous, since
structure is typically more conserved than sequence.
SCOP hierarchically classifies proteins at four structural levels:
Clear evolutionarily relationship
Most probable common evolutionary origin
Major structural similarity
Overall structural similarity
This hierarchical description of proteins allows the evalutaion of each method for different levels of similarity.
Materials & Methods
Reference protein dataset
The reference protein dataset used for evaluting the accuracy of each sequence comparison method was constructed based on the SCOP database 2.06, called ASTRAL 2.06. This data set includes all SCOP sequences that share less than 40% identity to each other. This dataset has become a one of the most reliable benchmark test sets in the evaluation of different methods to detect remote protein homologies.
Number of sequences: proteins (pr), families (fa), superfamilies (sf) and folds (cf)
You can download the Reference Protein Datasets from Download > Benchmark Data Sets.
The 33 alignment-free methods were used to calculate the dissimilarity/distance measures between all 21,750,310 possible combination pairs of 6,596 proteins from the reference dataset. Different combinations of input parameters were used to run word-based methods (i.e. word size from 1 to 4; protein alphabet consisted of either the original 20 amino acids or 11 reduced alphabet: different vector types such as count occurrences, frequencies, weighted counts, weighted frequencies, standardized frequencies with equal/equilibrium amino acid frequencies) and W-metric (different substitution matrices: PAM120-PAM250, BLOSUM30-70). In total, 529 independent runs of alignment-free calculations were performed using the reference dataset as query. Smith-Waterman algorithm (water) was run using score matrix BLOSUM50 and default parameters for gap scoring (i.e. gap open: 10, gap extend: 0.5).
For each alignment-free method run, the distances between all proteins pairs were subsequently sorted, from maximum to minimum similarity. The comparative test procedure was based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair was associated to four binary classifications (one for each level). The similarity measure for alignment method was based on the Smith–Waterman score, with no correction for statistical significance.
R package, ROCR was used to obtain ROC curves and AUC values for the Smith-Waterman run and all alignment-free methods (at each SCOP level).
ROC curves for the reference dataset.
These ROC curves were obtained for each SCOP level (i.e. class, fold, superfamily, family). Mouse-over ROC to see the Area Under the Curve (AUC).
As expected, AUC decreases from family to class level. The sequence similarity between proteins sharing the same family is still well recognized. Consequently, most distances achieve their best discrimination accuracy at this level.
Suprisingly, the alignment-based method (Smith-Waterman) is outperformed at all levels (AUCs: 0.62, 0.67, 0.78, 0.81) by two alignment-free measures, Normalized Google Distance (AUCs: 0.63, 0.78, 0.80, 0.84) and Bray-Curtis (AUCs: 0.63, 0.77, 0.80, 0.84).
Three other word-based methods, including two variants of Squared Euclidean Distance (d Eseq1 and d Eseq2) as well as the Canberra distance (dCanberra), though less accurate in recognition of relationships within class, obtained higher overall scores (AUC: 0.744, 0.733 and 0.725, respectively) than S-W (AUC: 0.72).
At family level, the Smith-Waterman algorithm was beaten by a modified Euclidean distance (d Eseq1).
At superfamily level, SW is third the most accurate sequence comparison method.
At fold level, 14 alignment-free methods perform better than alignment scores, including:
13 word-based metrics
all 4 variants of Euclidean distance, Manhattan distance, Minkowski distance,
Canberra distance, Chebyshev distance, 3 Angle metrics Normalized Google Distance and
1 (information teory)-based measure: d_star (based on Lempel-Ziv complexity)
At the class level, the Smith-Waterman is outperformed by 9 word-based alignment-free methods.
Word-based methods achieve, on average, slight better performance (AUC: 0.67 ±0.04) than IT-based solutions (AUC: 0.61 ±0.06). A half of IT-based measures (i.e. two variants Lempel-Ziv measure dLZ and dLZ1, Normalized Compression Distance and Base-base correlation) are located at the end of the ranking list. The best performing IT-based method is a variant of Lempel-Ziv distance (dLZ*), which is preceded by 5 word-based measures and Smith-Waterman algorithm.
Ranking of overall accuracy
In order to assess the overall acurracy of different metrics, we averaged AUC values from different SCOP levels (i.e. class, fold, superfamily, family).
Analysis by class
We further compared metrics accuracies at each of the four hierachical SCOP levels (i.e. class, fold, superfamily, family) separetely.
Full detailed results
Here we list the AUC values of all 33 in combination of different values of parameters. For example, for word-bases alignment-free methods these parameters include: k-mer length (from 1 to 4 amino acids) and vector type (e.g. counts, frequencies, weighted counts etc.).
In this section, we present time measurements of each alignment-free method as well as Smith-Waterman algorithm implemented in EMBOSS package as water program. We measure duration time in calcualtion of 21,750,310 pairwise SCOP protein comparisons. The measurements were conducted on a 64-bit 3.2-GHz x86-compatible Intel processor.
The Smith-Waterman (SW) algorithm is computationally intensive, it takes 3 days to complete the task. Its running times can be 1000-fold longer than that of the other metrics here compared.