ORCAN | FAQ

Frequently Asked Questions (FAQ)

Home
/
Help
/
FAQ

General

What is ORCAN?

ORCAN is a meta web server for discovering orthologs for protein sequences. The server links together results from 14 differernt bioinformatic tools and presents a list of predicted orthologous assignments along with their overall confidence scores (from 1 to 10).

How does ORCAN work?

The server queries 5 high quality orthology databases (eggNOG, OMA, OrthoDB, HOGENOM, InParanoid 8) and runs 4 popular orthology prediction tools (InParanoid 4.1, OrthoMCL, RBH, RSD) against latest version of UniProt Reference Proteomes.

Then, the server exploits 5 additional comparisons between query and orthologous sequences:

pairwise sequence alignments (using needle and water provided by EMBOSS)
protein domain architecture annotation (using HMMER3 against Pfam database)
functional motifs detection (using PROSITE scanner
Gene Ontology Terms association (retrieved from Gene Ontology Consortium)
available in NCBI PubMed articles headings.

In the end, ORCAN asses the orthologous assignments using a plurality-based rating system with scores ranging from from 1 to 10.

How does ORCAN combine results obtained from external tools?

Orthology prediction tools and databases, integrated in ORCAN, may very often provide different orthologous assignments for a given query sequences. Some resources may also lack any predictions for user's query. So how does ORCAN behave in such cases? First, ORCAN collects all unique orthologous assignments returned by prediction tools or databses. Then, for each orthologous assignment, ORCAN counts the number of tools supporting this prediction and presents the overall rating score.

You can see how all it works by running the 'Example 2' demonstration on the ORCAN submission page. In this example, 4 orthology prediction tools and 5 orthology datbases predict 3 potential orthologs. ORCAN presents a list of all predictions along with their overall confidence level.

What does the rating system tell me?

The rating system assigns a score (from 1 to 10) to each orthology prediction. This score is based on the level of consistency of the given prediction across the various databases and tools. For each orthologous assignment ORCAN returns the overall score by considering seven different features:

orthology predictions (InParanoid, OrthoMCL, RBH, RSD)
pairwise comparison between the query and the hit
content and order of protein domains
PubMed citations linked to the protein in the current version of NCBI databases
Gene Ontology terms

The rating system is useful when the tools integrated in ORCAN predict different orthologs for a given query.

How does ORCAN judge predictions from different tools?

By default, all 13 analyses contribute equally to the rating system. However, you can assign weights (integers from 1 to 10) to different tools thus adjusting their level of contribution in calculating an overall score for given orthologous assignment. Weight of 10 means that a given tool is 10 times more important than tools of weight of 1).

If you want to provide your own weights for certain tools, click on the icon on a submission or resulting page. Your configuration will be saved for all further searches for orthologs.

What else does the output tell me?

The ORCAN output include information about evolutionary and functional annotation of a query sequence.
These results are divided into 8 sections:

Orthologs - tabular raport of predicted orthologs for a query sequence using 4 different tools (InParanoid, OrthoMCL, RBH and RSD).
Databases - tabular raport of orthologs retrieved from 4 orthology databases (eggNOG, HOGENOM, OrthoDB,OrthoMCL). This sections includes link to databases and information about orthologous groups.
Sequence similarity - pairwise sequence alignments between query and orthologous proteins. This section includes global and local alignments providing information about the level of sequence similarity between query and orthologs.
Domain - graphical and textual reports comparing the content and order of protein domains that are present in query and orthologous proteins.
Gene Ontology - graphical report showing similarities of GO terms assigned to query and orthologous proteins.
Papers - a report showing similarities of articles (from PubMed) concerning query and orthologous proteins.
Summary - a report showing the overall evolutionary and functional relatedness of query and orthologous proteins.

How can I search for orthologs for my protein?

You only need to provide a protein sequence of interest (plain/FASTA format), specify the organism of the input sequence and output organism to search for orthologs. The server will execute orthology predictions and functional annotation in fully automated mode without the need for further user intervation.

Does ORCAN allow for any customization?

Yes - user can assemble an individualized annotation pipeline by selecting the computational components that best suit project needs. In addition, user can set or adjust the contribution of the selected tools in calculating the overall orthology rating by assigning weights (integers from 1 to 10) to individual tools.

Orthology prediction

What tools does ORCAN use to predict orthologs?

ORCAN uses 4 popular orthology prediction tools. Three of them are graph-based metods (InParanoid 4.1, OrthoMCL, RBH) for prediction of orthologs. Graph-based methods are suitable for orthology inferences from two or more complete genomes (proteomes). Unlike tree-based methods, they do not construct multiple-sequence alignments and phylogenetic trees, but rely on pairwise sequence similarities calculated between all sequences involved and an operational definition of orthology. In addition, ORCAN exploits RSD program that improves upon the common procedure of taking reciprocal best blast hits (rbh) in the identification of orthologs. The method—reciprocal smallest distance algorithm (rsd) relies on global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes.

How confident are ORCAN's predictions?

ORCAN predicts orthologs by talking into account many factors such as individual orthology predictions (4 tools & 4 databases) and functional annotation (5 tools). We tested ORCAN by performing 190 pairwise comparison of 20 yeast proteomes and using Pfam and Yeast Genome Order (YGOB) data as a quality reference for orthology annotation. ORCAN increases the sensitivity (proportion of ortholog predictions that are correct) by 2 percent points and precision (proportion of actual orthologs that are correctly predicted) by almost 1 percent point.

Are ORCAN results up-to-date?

ORCAN operates on latest, most up to date protein sequence data - as soon as UniProt releases a new set of reference proteomes, the data are automatically integrated without stopping or disrupting web services. Likewise, ORCAN automatically synchronizes data from Pfam and Prosite databases.

I got a warning "Sequence not found in the Uniprot Reference Proteome database". What does it mean?

There are two reasons for receiving this message. First, the protein sequence you provided does not belong to the species you selected. Second, the protein sequence has not been included in reference set of proteins of a given organisms (UniProt provides sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced, termed “proteomes”).

How does ORCAN behave when I have to cope with taxa having one-to-many or many-to-many orthologs?

The orthology prediction tools (InParanoid, OrthoMCL and RSD) integrated in ORCAN can predict one-to-many and many-to-many orthologous relationships for a given query. Likewise, most databases used in ORCAN can also return clusters of orthologs. If this is the case, ORCAN will present a list of all potential orthologs with their corresponding confidence level.

How come InParanoid and OrthoMCL work so fast in ORCAN?

ORCAN uses the original implementations of OrthoMCL 2.0.9 and InParanoid 4.1 with developer-recommended default parameters. The speed-up in their running time shown in ORCAN comes from the fact that the orthology prediction procedure is performed on a pre-selected set of proteins that show recognizable sequence similarity to the sequence of interest, rather than using a full set of proteins, of which the majority is functionally and evolutionary unrelated to the query sequence.

Technically, the search of orthologs in species B for a query protein A1 from species A can be described as follows. The protein sequence A1 is used as query in two BLAST searches, against proteome A and B. From both BLAST results, we collect all protein sequences that might be evolutionary relevant to the query protein – this is done by selecting first 20 highest-scored protein hits and all other protein matches (if they exist) that obtained e-value less than 1e-5. The e-value cut-off is most commonly used in searches for homologous proteins (e.g. default in OrthoMCL, RSD, Pfam). In addition, the selection of 20 top-ranked proteins (regardless of e-value) guarantees that true ortholog (if exists) of the query protein is present among selected proteins, even in the case of searches, which include phylogenetically distant organisms (e.g. human and bacteria). In the next step, the FASTA sequences of the selected proteins are retrieved from proteomes A and B and used to create corresponding datasets, A` and B`. These datasets A` and B` are then used as input to OrthoMCL 2.0.9 and InParanoid 4.1. Once the orthology prediction is finished, ORCAN parse the output files and reports the potential orthologs and co-orthologs for the user’s query sequence.

Of note, the four popular orthology prediction algorithms that are integrated in ORCAN (i.e. InParanoid, OrthoMCL, RBH and RSD) are so-called graph-based methods, which means that they use different variants (depending on the underlying algorithm) of reciprocal BLAST searches to find the ‘nearest neighbor’ (Kuzniar et al., Trends in Genetics, 2008). Obviously, in order to be methodologically correct, all graph-based programs require complete proteomes for orthology inference. However, graph-based approach does not require to predict all orthologs between two proteomes when user is interested only in one protein. For example, in the simplest RBH case, one can BLAST a protein of interest against a subject's complete proteome, then take a best hit and BLAST it the other way against query's complete proteome. Also, the RSD program lets users to find orthologs for specific sequences in the query genome without the time-consuming step of calculating all other orthologs.

Orthology databases

What databases does ORCAN scan?

ORCAN queries 4 high quality orthology databases: OMA Browser, OrthoDB, eggNOG 4.5, InParanoid 8 and HOGENOM

How up-to-date are orthologs retrieved from databases?

The databases contain pre-computed orthologous gene assignments. These calculation, however, scale quadratically along with an increasing number of genomes, which makes the inclusion of all available genomes in the databases no longer feasible. As a result, information in the orthology databases lags behind the current sequence resources, and this gap is expected to increase in the future. The databases are being updated, on average, once a year. Since UniProt releases new set of proteomes every month, we recommend to draw more attention at the results of live predictions (InParanoid, OrthoMCL, RBH and RSD).

Functional annotation

What does ORCAN use for functional annotation of orthologs?

ORCAN exploits 5 comparisons between query protein sequence and identified orthologs: pairwise sequence alignments, annotation of protein domains, detection of functional motifs, association of Gene Ontology Terms and retrieval of relevant articles.

How does ORCAN retrieve articles about given protein?

ORCAN uses two approaches to find PubMed articles relevant to the query:

The server connects with UniProt records of query and orthologous proteins to link PubMed literature (those publications associated with more than 100 protein records are excluded).
ORCAN directly searches the PubMed database using gene names (if available) and UniProt accessions.

How does ORCAN compare proteins at the level of protein domains?

ORCAN treats proteins as strings of domains (domain arrangements).The similarity between two proteins is determined by aligning the proteins’ domain arrangements to each other using a dynamic programming algorithm. Domain arrangements are compared using a version of the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) with values for match (same domain ID), mismatch (different domain IDs) and affine gap costs.

How does ORCAN retrieve Gene Ontology Terms for identified proteins?

For query and orthologous proteins ORCAN uses cross-links between the UniProt and Gene Ontology databases.

Other

Can I use ORCAN with multiple query sequence at once?

At the moment ORCAN can only annotate query proteins one at a time since these calculations are highly complex.

Can I download ORCAN and use it as a standalone program?

At the moment ORCAN works as a web-service. Most tools used by ORCAN requires a specific setting and many dependencies and external libraries.