Alfree | Benchmark

Benchmark Overview

The performance of 33 alignment-free measures (methods list: ) was assessed by calculating areas under the ROC curves (AUC) with reference to SCOPe database (Structural Classification Of Proteins) (1). The implementations of the alignment-free methods used in this study are available as a Python package. The accuracy and speed of alignment-free measures were also tested against Smith-Waterman algorithm implemented in water program (from EMBOSS package).

SCOP Database Overview

The SCOP database consists of Protein Data Bank (PDB) entries and provides a detailed and reliable description of protein structure relationships. The three-dimensional (3D) structure analysis allows the detection of more remote homologous, since structure is typically more conserved than sequence.

SCOP hierarchically classifies proteins at four structural levels:

1	family (fa)	Clear evolutionarily relationship
2	superfamily (sf)	Most probable common evolutionary origin
3	fold (cf)	Major structural similarity
4	class (cl)	Overall structural similarity

This hierarchical description of proteins allows the evalutaion of each method for different levels of similarity.

Materials & Methods

Reference protein dataset

The reference protein dataset used for evaluting the accuracy of each sequence comparison method was constructed based on the SCOP database 2.06, called ASTRAL 2.06. This data set includes all SCOP sequences that share less than 40% identity to each other. This dataset has become a one of the most reliable benchmark test sets in the evaluation of different methods to detect remote protein homologies.

Similarily to the study of Vinga et al. Bioinformatics. 2004, we trimmed original ASTRAL dataset to:

exclude sequences with unknown amino acids
exclude families with less than 5 proteins
include only the four major classes
- α class - constituted mainly by proteins with α helix
- β class - essentially formed by β-sheet structures
- α/β class - proteins with mixtures of α-helices and β-strands
- α + β class - those where α-helices and β-strands are largely segregated

Protein dataset used in this study:

Db	α				β				α/β				α + β				total
Db	pr	fa	sf	cf	pr	fa	sf	cf	pr	fa	sf	cf	pr	fa	sf	cf	total
astral40	2,439	984	513	289	2,795	890	365	177	3,970	939	246	148	3,346	1,217	561	385	12,550
alfree40	1,072	98	57	46	1,469	112	62	44	2,478	159	75	59	1,577	144	88	70	6,596

Number of sequences: proteins (pr), families (fa), superfamilies (sf) and folds (cf)

You can download the Reference Protein Datasets from Download > Benchmark Data Sets .

Benchmark methods

The 33 alignment-free methods were used to calculate the dissimilarity/distance measures between all 21,750,310 possible combination pairs of 6,596 proteins from the reference dataset. Different combinations of input parameters were used to run word-based methods (i.e. word size from 1 to 4; protein alphabet consisted of either the original 20 amino acids or 11 reduced alphabet: different vector types such as count occurrences, frequencies, weighted counts, weighted frequencies, standardized frequencies with equal/equilibrium amino acid frequencies) and W-metric (different substitution matrices: PAM120-PAM250, BLOSUM30-70). In total, 529 independent runs of alignment-free calculations were performed using the reference dataset as query. Smith-Waterman algorithm (water) was run using score matrix BLOSUM50 and default parameters for gap scoring (i.e. gap open: 10, gap extend: 0.5).

Similarily to the study by Vinga et al. (2004):

For each alignment-free method run, the distances between all proteins pairs were subsequently sorted, from maximum to minimum similarity. The comparative test procedure was based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair was associated to four binary classifications (one for each level). The similarity measure for alignment method was based on the Smith–Waterman score, with no correction for statistical significance.

R package, ROCR was used to obtain ROC curves and AUC values for the Smith-Waterman run and all alignment-free methods (at each SCOP level).

Results

ROC curves for the reference dataset.

These ROC curves were obtained for each SCOP level (i.e. class, fold, superfamily, family). Mouse-over ROC to see the Area Under the Curve (AUC).

Conclusions:

As expected, AUC decreases from family to class level. The sequence similarity between proteins sharing the same family is still well recognized. Consequently, most distances achieve their best discrimination accuracy at this level.
Suprisingly, the alignment-based method (Smith-Waterman) is outperformed at all levels (AUCs: 0.62, 0.67, 0.78, 0.81) by two alignment-free measures, Normalized Google Distance (AUCs: 0.63, 0.78, 0.80, 0.84) and Bray-Curtis (AUCs: 0.63, 0.77, 0.80, 0.84).
Three other word-based methods, including two variants of Squared Euclidean Distance (d ^Eseq1 and d ^Eseq2) as well as the Canberra distance (d^Canberra), though less accurate in recognition of relationships within class, obtained higher overall scores (AUC: 0.744, 0.733 and 0.725, respectively) than S-W (AUC: 0.72).
At family level, the Smith-Waterman algorithm was beaten by a modified Euclidean distance (d ^Eseq1).
At superfamily level, SW is third the most accurate sequence comparison method.
At fold level, 14 alignment-free methods perform better than alignment scores, including:
- 13 word-based metrics
  - all 4 variants of Euclidean distance, Manhattan distance, Minkowski distance, Canberra distance, Chebyshev distance, 3 Angle metrics Normalized Google Distance and Bray-Curtis distance.
- 1 (information teory)-based measure: d_star (based on Lempel-Ziv complexity)
At the class level, the Smith-Waterman is outperformed by 9 word-based alignment-free methods.
Word-based methods achieve, on average, slight better performance (AUC: 0.67 ±0.04) than IT-based solutions (AUC: 0.61 ±0.06). A half of IT-based measures (i.e. two variants Lempel-Ziv measure d^LZ and d^LZ₁, Normalized Compression Distance and Base-base correlation) are located at the end of the ranking list. The best performing IT-based method is a variant of Lempel-Ziv distance (d^LZ_*), which is preceded by 5 word-based measures and Smith-Waterman algorithm.

Ranking of overall accuracy

In order to assess the overall acurracy of different metrics, we averaged AUC values from different SCOP levels (i.e. class, fold, superfamily, family).

Analysis by class

We further compared metrics accuracies at each of the four hierachical SCOP levels (i.e. class, fold, superfamily, family) separetely.

Full detailed results

Here we list the AUC values of all 33 in combination of different values of parameters. For example, for word-bases alignment-free methods these parameters include: k-mer length (from 1 to 4 amino acids) and vector type (e.g. counts, frequencies, weighted counts etc.).

Time

In this section, we present time measurements of each alignment-free method as well as Smith-Waterman algorithm implemented in EMBOSS package as water program. We measure duration time in calcualtion of 21,750,310 pairwise SCOP protein comparisons. The measurements were conducted on a 64-bit 3.2-GHz x86-compatible Intel processor.

The Smith-Waterman (SW) algorithm is computationally intensive, it takes 3 days to complete the task. Its running times can be 1000-fold longer than that of the other metrics here compared.

INTRODUCTION

Materials & Methods

RESULTS

Benchmark Overview

SCOP Database Overview

Materials & Methods

Reference protein dataset

Protein dataset used in this study:

Benchmark methods

Results

ROC curves for the reference dataset.

Conclusions:

Ranking of overall accuracy

Analysis by class

Full detailed results

Time