whub | Documentation

Help Documentation

Whub is a freely accessible web portal to study functional W-containing motifs in animal, plant and viral proteins.

Whub assists newcomers to the field by providing a one-stop shop solution to access information on RNAi-related W-containing motifs, while offering expert users an integral open-source package.

Main features

GW proteins

Catalog of known AGO-binding proteins that aims at giving immediate insight into current knowledge about the respective gene, including a focus on mutagenesis information associated with RNAi phenotypes.

Every protein entry (e.g. TNRC6A) page is subdivided into distinct collapsible portlets.

Domain configuration panel provides a visual representation of the location of all Trp occurrences in the context of protein domains and the entire protein.

More precise information about the biochemically-mapped minimal W-rich regions, such as regions sufficient for interaction with different AGO family members, is obtained through manual searches of the relevant literature and shown as a table view in the Regions panel. The Mutagenesis section contains a searchable and sortable list of experimental mutations. It describes the effect of the experimental mutation of one or more amino acid, singly or in combination of several point mutations, on the biological properties of the protein.

Sequence panel displays the full-length protein sequence to which all positional annotation of the other sections refers. In addition, the record page provides interactive domains match viewer and feature highlighting; moving the cursor over any annotation feature will highlight its position in the full-length protein sequence.

Compositional analysis of single W-containing motifs

Computational framework for analyzing the impact of flanking residues in several thousands of single W-containing motifs for Ago-binding proteins from plants, animals and viruse.

Whub provides a way of visualizing a PWM profile of W-containing motifs in a form of an interactive heatmap. The visualization gives the user immediate information about preferences of amino acids to be specifically present (red) or absent (blue) on certain motif positions. The actual amino acid score values on different positions can be easily inspected by moving the mouse over the respective row in the heatmap.

In addition, by analyzing the motifs that build up the PWM profile, the web page provides a frequency distribution of motif position occupacy showing how often each position is present in the analyzed motifs.

Whub also provides information about overall amino acid composition of W-containing motifs, which is presented to the user as a donut chart and table that contains more detailed numerical data such as log-odds, frequency ratios.

Additionally, the portal provides access to a searchable list of the the analyzed motifs, in terms of their sequence, source organims, protein name and position.

Game

An interactive game enabling user to in silico design synthetic or modify existing GW/WG domains through series of drag-and-drops

Whub provides an intuitive gamification framework that converts laborious and time consuming mutagenesis assays into a puzzle game that can be easily understood and played by web users through user-firendly interface. When a game starts, players can choose to either design domain sequences de novo or remodel existing sequences of W-containing domains. Here, amino acid sequence is represented by a string of blocks of different intensity of red and blue colors reflecting PWM scoring values for a given amino acid to be present at given position. Each amino-acid block can be replaced by any other amino acid through clicks/taps, or can be moved via drag-and-drops, if necessary pushing its neighbors, emulating transposition events. Player can also insert and delete amino-acid blocks.

The goal of the framework is to modify the blocks in order to find a configuration that maximizes score and thereby red color intensity of all blocks. We score the puzzles using a scheme for assembled motifs as in Agos2. As the player modifies the sequence, the score and block colors are automatically recomputed and displayed.

Relevant papers

Handy way of browsing the bibliographic citations on GW proteins.

Literature citations are shown to users as cards, each containing first author, journal and year of the publication. Whub allows to filter the literature collection on the fly; for example through a few mouse clicks or finger taps users can narrow the corpus to research articles concerning AGO-binding proteins in plants sorted by the year of publication.

User can click on any card and fetch PubMed citation, when available, and display the summary in Whub's library. In addition, paper-to-display options provides ....

Wsearch (Agos2) on-line scanner

Rapid detection of highest-scoring single functional W-containing motifs.

As matrices put more weight on residues close to Trp, the chance of detecting false, artifactual long and low-complexity sequences of overall compositional compatibility (e.g. glycine-rich domains, WW domains) is minimized. The new service now allows the prediction of potentially functional single motifs, determination of their boundaries as well as statistical quantifications of predicted sequences, alone or in combination.

Batch mode

Software for large-scale high-throughput protein sequence analysis.

Agos2 to allow it to support large-scale, high-throughput sequence analysis. In addition to screening large protein sets using built-in matrices, users can create their own PWM profiles based on custom protein sequences, which is a feature not available from the main Whub web interface.

i-Wsearch

(Machine-learning)-based AGO-binding classifier

i-Wsearch formulates the AGO-binding activity as a binary classification problem, where W-motif is labeled with 1 if it is a AGO-binding site or 0 otherwise. We employ the sliding window technique to encode the amino acid residues flanking Trp residues of proteins. Each amino acid residue in the Trp neighbor proﬁle is characterized by 3 descriptors: hydrophilicity, flexibility and volume. In addition, two amino-acid distances to nearest Trp residues at N- and C-termini from motif are also considered.

i-Wsearch achieves high prediction performance (SN = 95.17%; SP = 99.94%; PPV = 99.68%)

Position-specific Scoring Matrix (PSSM) for single W-containing motifs

Wsearch uses PSSM to score Trp(W)-containing sequences.

The rational behind this position-specific approach comes from the experimental observation that tryptophan is crucial for the interaction, and its context modulates to the strength of the interaction.

Not all GW repeats contribute equally to the interaction [Takimoto et al. 2009, Eulalio et al. 2009, Yao et al. 2011, Chekulaeva et al. 2010].
Mutational analysis showed that substituting residues adjacent to the repeats affected AGO binding [Till et al. 2007, Pfaff et al. 2013].

The implementation of such a model follows the basic idea behind position weight matrix (PWM). In this way, tryptophan is assumed as a midpoint of the matrix, and likelihoods of all other non-tryptophan amino acid spread along more distant positions in both directions from the central residue. As lengths of single W-containing motif most often span up to several residues, longer and hence less frequent sequences render likelihoods of amino acids at further positions decrease. It reflects the experimental observation that amino acids close to tryptophan in primary sequence are more crucial for the function of W motifs, rather than residues at further distances.

PSSM Calculation

The source sequence dataset consists of manually selected collection of GW protein domains already known to interact with AGO proteins. For example, NRPE1 subunit of polymerase V in Arabidopsis.

>gi|79571777|ref|NP_181532.2| NRPE1|1285:1714|Arabidopsis thaliana
DSALGEPKFEDSADFQNLHDEGKPSGANWEKSSSWDNGCSGGSEWGVSKSTGGEANPESNWEKTTNVEKE
DAWSSWNTRKDAQESSKSDSGGAWGIKTKDADADTTPNWETSPAPKDSIVPENNEPTSDVWGHKSVSDKS
WDKKNWGTESAPAAWGSTDAAVWGSSDKKNSETESDAAAWGSRDKNNSDVGSGAGVLGPWNKKSSETESN
GATWGSSDKTKSGAAAWNSWDKKNIETDSEPAAWGSQGKKNSETESGPAAWGAWDKKKSETEPGPAGWGM
GDKKNSETELGPAAMGNWDKKKSDTKSGPAAWGSTDAAAWGSSDKNNSETESDAAAWGSRNKKTSEIESG
AGAWGSWGQPSPTAEDKDTNEDDRNPWVSLKETKSREKDDKERSQWGNPAKKFPSSGGWSNGGGADWKGN
RNHTPRPPRS

From this initial set, we extracted all non-overlapped subsequences that contain single Trp residue flanked by non-tryptophan amino acids.

>gi|79571777|ref|NP_181532.2| NRPE1|1285:1316|Arabidopsis thaliana
DSALGEPKFEDSADFQNLHDEGKPSGANWEKS
>gi|79571777|ref|NP_181532.2| NRPE1|1316:1324|Arabidopsis thaliana
SSSWDNGCS
>gi|79571777|ref|NP_181532.2| NRPE1|1324:1337|Arabidopsis thaliana
SGGSEWGVSKSTGG
>gi|79571777|ref|NP_181532.2| NRPE1|1337:1351|Arabidopsis thaliana
GEANPESNWEKTTNV
>gi|79571777|ref|NP_181532.2| NRPE1|1351:1358|Arabidopsis thaliana
VEKEDAWS
>gi|79571777|ref|NP_181532.2| NRPE1|1359:1369|Arabidopsis thaliana
SWNTRKDAQES
>gi|79571777|ref|NP_181532.2| NRPE1|1369:1385|Arabidopsis thaliana
SSKSDSGGAWGIKTKDA
>gi|79571777|ref|NP_181532.2| NRPE1|1386:1404|Arabidopsis thaliana
DADTTPNWETSPAPKDSIV
>gi|79571777|ref|NP_181532.2| NRPE1|1404:1420|Arabidopsis thaliana
VPENNEPTSDVWGHKSV
>gi|79571777|ref|NP_181532.2| NRPE1|1420:1427|Arabidopsis thaliana
VSDKSWDK
>gi|79571777|ref|NP_181532.2| NRPE1|1428:1434|Arabidopsis thaliana
KNWGTES
>gi|79571777|ref|NP_181532.2| NRPE1|1435:1443|Arabidopsis thaliana
APAAWGSTD
>gi|79571777|ref|NP_181532.2| NRPE1|1443:1455|Arabidopsis thaliana
DAAVWGSSDKKNS
>gi|79571777|ref|NP_181532.2| NRPE1|1456:1474|Arabidopsis thaliana
ETESDAAAWGSRDKNNSDV
>gi|79571777|ref|NP_181532.2| NRPE1|1474:1491|Arabidopsis thaliana
VGSGAGVLGPWNKKSSET
>gi|79571777|ref|NP_181532.2| NRPE1|1491:1504|Arabidopsis thaliana
TESNGATWGSSDKT
>gi|79571777|ref|NP_181532.2| NRPE1|1505:1512|Arabidopsis thaliana
KSGAAAWN
>gi|79571777|ref|NP_181532.2| NRPE1|1513:1521|Arabidopsis thaliana
SWDKKNIET
>gi|79571777|ref|NP_181532.2| NRPE1|1521:1536|Arabidopsis thaliana
TDSEPAAWGSQGKKNS
>gi|79571777|ref|NP_181532.2| NRPE1|1537:1546|Arabidopsis thaliana
ETESGPAAWG
>gi|79571777|ref|NP_181532.2| NRPE1|1547:1555|Arabidopsis thaliana
AWDKKKSET
>gi|79571777|ref|NP_181532.2| NRPE1|1555:1572|Arabidopsis thaliana
TEPGPAGWGMGDKKNSET
>gi|79571777|ref|NP_181532.2| NRPE1|1572:1589|Arabidopsis thaliana
TELGPAAMGNWDKKKSDT
>gi|79571777|ref|NP_181532.2| NRPE1|1589:1600|Arabidopsis thaliana
TKSGPAAWGSTD
>gi|79571777|ref|NP_181532.2| NRPE1|1600:1612|Arabidopsis thaliana
DAAAWGSSDKNNS
>gi|79571777|ref|NP_181532.2| NRPE1|1613:1629|Arabidopsis thaliana
ETESDAAAWGSRNKKTS
>gi|79571777|ref|NP_181532.2| NRPE1|1630:1639|Arabidopsis thaliana
EIESGAGAWG
>gi|79571777|ref|NP_181532.2| NRPE1|1640:1651|Arabidopsis thaliana
SWGQPSPTAEDK
>gi|79571777|ref|NP_181532.2| NRPE1|1651:1670|Arabidopsis thaliana
KDTNEDDRNPWVSLKETKSR
>gi|79571777|ref|NP_181532.2| NRPE1|1671:1686|Arabidopsis thaliana
EKDDKERSQWGNPAKK
>gi|79571777|ref|NP_181532.2| NRPE1|1687:1697|Arabidopsis thaliana
FPSSGGWSNGG
>gi|79571777|ref|NP_181532.2| NRPE1|1697:1714|Arabidopsis thaliana
GGADWKGNRNHTPRPPRS

Sequence motifs are aligned to center the Trp. The Trp residue is a midpoint at position 0, and flanking non-Trp residues spread through positions in both directions, N- and C-termini.

DSALGEPKFEDSADFQNLHDEGKPSGANWEKS----------
-------------------------SSSWDNGCS--------
----------------------FPSSGGWSNGG---------
-----------------------SGGSEWGVSKSTGG-----
-----------------------VSDKSWDK-----------
------------------VGSGAGVLGPWNKKSSET------
---------------------------SWDKKNIET------
---------------------------AWDKKKSET------
------------------TELGPAAMGNWDKKKSDT------
------------------------APAAWGSTD---------
---------------------TKSGPAAWGSTD---------
--------------------ETESGPAAWG------------
---------------------TEPGPAGWGMGDKKNSET---
------------------------DAAVWGSSDKKNS-----
--------------------ETESDAAAWGSRDKNNSDV---
------------------------DAAAWGSSDKNNS-----
--------------------ETESDAAAWGSRNKKTS-----
----------------------KSGAAAWN------------
--------------------EIESGAGAWG------------
---------------------TDSEPAAWGSQGKKNS-----
---------------------TESNGATWGSSDKT-------
--------------------------KNWGTES---------
---------------------------SWGQPSPTAEDK---
-------------------SSKSDSGGAWGIKTKDA------
-----------------VPENNEPTSDVWGHKSV--------
---------------------DADTTPNWETSPAPKDSIV--
-------------------EKDDKERSQWGNPAKK-------
----------------------VEKEDAWS------------
--------------------GEANPESNWEKTTNV-------
---------------------------SWNTRKDAQES----
------------------------GGADWKGNRNHTPRPPRS
------------------KDTNEDDRNPWVSLKETKSR----
                            *

As subsequences differ in lengths, the weights of Trp-flanking residues decrease.
Observed frequencies (Pobs) of all non-tryptophan residues flanking Trp were obtained from counts of amino acids within each column of the profile, as follows: Pobsia = Nia/N, where i - each of the position in the motif sequence; a - each of the amino acids present in the given position i; Nia - number of occurrences of amino acid a at given position i; N - number of the motif sequences. The observed frequencies were compared to the corresponding expected frequencies (Pexp) obtained from background subsequences in UniProt and used to calculate a log-odds according to the following formula: Dia = 2 x log2(Pobs/Pexp).

Scoring W-containing motifs

As a starting point, locations of Trp residues are found in a protein sequence. In analogy to the extension of alignments in BLAST algorithm, the scoring progresses in the left and right directions of each seed Trp one residue at a time. For each direction, the cumulative scores are calculated using PSSM values for every position. The motif extension does not stop until the running score reaches the maximum accumulated score or meets another W residue. Both scores from the left and right directions are summed and the highest-scoring single motif is returned. The overlapping motifs are joined and the score value for the assembled domain is calculated.

We formulate the AGO-binding activity as a binary classification problem, where W-motif is labeled with 1 if it is a AGO-binding site or 0 otherwise. We employ the sliding window technique to encode the amino acid residues flanking Trp residues of proteins. Each amino acid residue is characterized by number of descriptors including properties of amino acid residues, as well as two distances to nearest Trp residues at N- and C-termini from motif. In other words, whether Trp reisude belongs to the AGO-binding class or not is determined by its neighbor residues context and distances to nearest Trp.

The performance of the prediction quality was evaluated by 10-fold cross-validation experiments. For example, when window parameter of 11 amino acids is selected, the RF-based classifier achieves sensitivity of 94.29%, specificity of 99.68% and precision 99.01%. The prediction performance is slightly improved (SN = 95.17%; SP = 99.94%; PPV = 99.68%) with combination of three residue features: flexibility, hydrophilicity and volume, as the descriptor of the AGO binding information