The performance of 33 alignment-free measures (methods list: ) was assessed by calculating areas under the ROC curves (AUC) with reference to SCOPe database (Structural Classification Of Proteins) (1). The implementations of the alignment-free methods used in this study are available as a Python package. The accuracy and speed of alignment-free measures were also tested against Smith-Waterman algorithm implemented in water program (from EMBOSS package).
The SCOP database consists of Protein Data Bank (PDB) entries and provides a detailed and reliable description of protein structure relationships. The three-dimensional (3D) structure analysis allows the detection of more remote homologous, since structure is typically more conserved than sequence.
SCOP hierarchically classifies proteins at four structural levels:
1 | family (fa) | Clear evolutionarily relationship | |
2 | superfamily (sf) | Most probable common evolutionary origin | |
3 | fold (cf) | Major structural similarity | |
4 | class (cl) | Overall structural similarity |
This hierarchical description of proteins allows the evalutaion of each method for different levels of similarity.
The reference protein dataset used for evaluting the accuracy of each sequence comparison method was constructed based on the SCOP database 2.06, called ASTRAL 2.06. This data set includes all SCOP sequences that share less than 40% identity to each other. This dataset has become a one of the most reliable benchmark test sets in the evaluation of different methods to detect remote protein homologies.
Similarily to the study of Vinga et al. Bioinformatics. 2004, we trimmed original ASTRAL dataset to:
Db | α | β | α/β | α + β | total | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pr | fa | sf | cf | pr | fa | sf | cf | pr | fa | sf | cf | pr | fa | sf | cf | ||
astral40 | 2,439 | 984 | 513 | 289 | 2,795 | 890 | 365 | 177 | 3,970 | 939 | 246 | 148 | 3,346 | 1,217 | 561 | 385 | 12,550 |
alfree40 | 1,072 | 98 | 57 | 46 | 1,469 | 112 | 62 | 44 | 2,478 | 159 | 75 | 59 | 1,577 | 144 | 88 | 70 | 6,596 |
The 33 alignment-free methods were used to calculate the dissimilarity/distance measures between all 21,750,310 possible combination pairs of 6,596 proteins from the reference dataset. Different combinations of input parameters were used to run word-based methods (i.e. word size from 1 to 4; protein alphabet consisted of either the original 20 amino acids or 11 reduced alphabet: different vector types such as count occurrences, frequencies, weighted counts, weighted frequencies, standardized frequencies with equal/equilibrium amino acid frequencies) and W-metric (different substitution matrices: PAM120-PAM250, BLOSUM30-70). In total, 529 independent runs of alignment-free calculations were performed using the reference dataset as query. Smith-Waterman algorithm (water) was run using score matrix BLOSUM50 and default parameters for gap scoring (i.e. gap open: 10, gap extend: 0.5).
Similarily to the study by Vinga et al. (2004):
For each alignment-free method run, the distances between all proteins pairs were subsequently sorted, from maximum to minimum similarity. The comparative test procedure was based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair was associated to four binary classifications (one for each level). The similarity measure for alignment method was based on the Smith–Waterman score, with no correction for statistical significance.
R package, ROCR was used to obtain ROC curves and AUC values for the Smith-Waterman run and all alignment-free methods (at each SCOP level).
These ROC curves were obtained for each SCOP level (i.e. class, fold, superfamily, family). Mouse-over ROC to see the Area Under the Curve (AUC).
In order to assess the overall acurracy of different metrics, we averaged AUC values from different SCOP levels (i.e. class, fold, superfamily, family).
We further compared metrics accuracies at each of the four hierachical SCOP levels (i.e. class, fold, superfamily, family) separetely.
Here we list the AUC values of all 33 in combination of different values of parameters. For example, for word-bases alignment-free methods these parameters include: k-mer length (from 1 to 4 amino acids) and vector type (e.g. counts, frequencies, weighted counts etc.).
In this section, we present time measurements of each alignment-free method as well as Smith-Waterman algorithm implemented in EMBOSS package as water program. We measure duration time in calcualtion of 21,750,310 pairwise SCOP protein comparisons. The measurements were conducted on a 64-bit 3.2-GHz x86-compatible Intel processor.
The Smith-Waterman (SW) algorithm is computationally intensive, it takes 3 days to complete the task. Its running times can be 1000-fold longer than that of the other metrics here compared.