Help Documentation
Whub is a freely accessible web portal to study functional W-containing motifs in animal, plant and viral proteins.
Whub assists newcomers to the field by providing a one-stop shop solution to access information on RNAi-related W-containing motifs, while offering expert users an integral open-source package.Catalog of known AGO-binding proteins that aims at giving immediate insight into current knowledge about the respective gene, including a focus on mutagenesis information associated with RNAi phenotypes.
Every protein entry (e.g. TNRC6A) page is subdivided into distinct collapsible portlets.
Domain configuration panel provides a visual representation of the location of all Trp occurrences in the context of protein domains and the entire protein.
More precise information about the biochemically-mapped minimal W-rich regions, such as regions sufficient for interaction with different AGO family members, is obtained through manual searches of the relevant literature and shown as a table view in the Regions panel. The Mutagenesis section contains a searchable and sortable list of experimental mutations. It describes the effect of the experimental mutation of one or more amino acid, singly or in combination of several point mutations, on the biological properties of the protein.
Sequence panel displays the full-length protein sequence to which all positional annotation of the other sections refers. In addition, the record page provides interactive domains match viewer and feature highlighting; moving the cursor over any annotation feature will highlight its position in the full-length protein sequence.
Computational framework for analyzing the impact of flanking residues in several thousands of single W-containing motifs for Ago-binding proteins from plants, animals and viruse.
Whub provides a way of visualizing a PWM profile of W-containing motifs in a form of an interactive heatmap. The visualization gives the user immediate information about preferences of amino acids to be specifically present (red) or absent (blue) on certain motif positions. The actual amino acid score values on different positions can be easily inspected by moving the mouse over the respective row in the heatmap.
In addition, by analyzing the motifs that build up the PWM profile, the web page provides a frequency distribution of motif position occupacy showing how often each position is present in the analyzed motifs.
Whub also provides information about overall amino acid composition of W-containing motifs, which is presented to the user as a donut chart and table that contains more detailed numerical data such as log-odds, frequency ratios.
Additionally, the portal provides access to a searchable list of the the analyzed motifs, in terms of their sequence, source organims, protein name and position.
An interactive game enabling user to in silico design synthetic or modify existing GW/WG domains through series of drag-and-drops
Whub provides an intuitive gamification framework that converts laborious and time consuming mutagenesis assays into a puzzle game that can be easily understood and played by web users through user-firendly interface. When a game starts, players can choose to either design domain sequences de novo or remodel existing sequences of W-containing domains. Here, amino acid sequence is represented by a string of blocks of different intensity of red and blue colors reflecting PWM scoring values for a given amino acid to be present at given position. Each amino-acid block can be replaced by any other amino acid through clicks/taps, or can be moved via drag-and-drops, if necessary pushing its neighbors, emulating transposition events. Player can also insert and delete amino-acid blocks.
The goal of the framework is to modify the blocks in order to find a configuration that maximizes score and thereby red color intensity of all blocks. We score the puzzles using a scheme for assembled motifs as in Agos2. As the player modifies the sequence, the score and block colors are automatically recomputed and displayed.
Handy way of browsing the bibliographic citations on GW proteins.
Literature citations are shown to users as cards, each containing first author, journal and year of the publication. Whub allows to filter the literature collection on the fly; for example through a few mouse clicks or finger taps users can narrow the corpus to research articles concerning AGO-binding proteins in plants sorted by the year of publication.
User can click on any card and fetch PubMed citation, when available, and display the summary in Whub's library. In addition, paper-to-display options provides ....
Rapid detection of highest-scoring single functional W-containing motifs.
As matrices put more weight on residues close to Trp, the chance of detecting false, artifactual long and low-complexity sequences of overall compositional compatibility (e.g. glycine-rich domains, WW domains) is minimized. The new service now allows the prediction of potentially functional single motifs, determination of their boundaries as well as statistical quantifications of predicted sequences, alone or in combination.
Software for large-scale high-throughput protein sequence analysis.
Agos2 to allow it to support large-scale, high-throughput sequence analysis. In addition to screening large protein sets using built-in matrices, users can create their own PWM profiles based on custom protein sequences, which is a feature not available from the main Whub web interface.
(Machine-learning)-based AGO-binding classifier
i-Wsearch formulates the AGO-binding activity as a binary classification problem, where W-motif is labeled with 1 if it is a AGO-binding site or 0 otherwise. We employ the sliding window technique to encode the amino acid residues flanking Trp residues of proteins. Each amino acid residue in the Trp neighbor proļ¬le is characterized by 3 descriptors: hydrophilicity, flexibility and volume. In addition, two amino-acid distances to nearest Trp residues at N- and C-termini from motif are also considered.
i-Wsearch achieves high prediction performance (SN = 95.17%; SP = 99.94%; PPV = 99.68%)
Position-specific Scoring Matrix (PSSM) for single W-containing motifs
Wsearch uses PSSM to score Trp(W)-containing sequences.
The rational behind this position-specific approach comes from the experimental observation that tryptophan is crucial for the interaction, and its context modulates to the strength of the interaction.
- Not all GW repeats contribute equally to the interaction [Takimoto et al. 2009, Eulalio et al. 2009, Yao et al. 2011, Chekulaeva et al. 2010].
- Mutational analysis showed that substituting residues adjacent to the repeats affected AGO binding [Till et al. 2007, Pfaff et al. 2013].
PSSM Calculation
The source sequence dataset consists of manually selected collection of GW protein domains already known to interact with AGO proteins. For example, NRPE1 subunit of polymerase V in Arabidopsis.
>gi|79571777|ref|NP_181532.2| NRPE1|1285:1714|Arabidopsis thaliana DSALGEPKFEDSADFQNLHDEGKPSGANWEKSSSWDNGCSGGSEWGVSKSTGGEANPESNWEKTTNVEKE DAWSSWNTRKDAQESSKSDSGGAWGIKTKDADADTTPNWETSPAPKDSIVPENNEPTSDVWGHKSVSDKS WDKKNWGTESAPAAWGSTDAAVWGSSDKKNSETESDAAAWGSRDKNNSDVGSGAGVLGPWNKKSSETESN GATWGSSDKTKSGAAAWNSWDKKNIETDSEPAAWGSQGKKNSETESGPAAWGAWDKKKSETEPGPAGWGM GDKKNSETELGPAAMGNWDKKKSDTKSGPAAWGSTDAAAWGSSDKNNSETESDAAAWGSRNKKTSEIESG AGAWGSWGQPSPTAEDKDTNEDDRNPWVSLKETKSREKDDKERSQWGNPAKKFPSSGGWSNGGGADWKGN RNHTPRPPRS
From this initial set, we extracted all non-overlapped subsequences that contain single Trp residue flanked by non-tryptophan amino acids.
>gi|79571777|ref|NP_181532.2| NRPE1|1285:1316|Arabidopsis thaliana DSALGEPKFEDSADFQNLHDEGKPSGANWEKS >gi|79571777|ref|NP_181532.2| NRPE1|1316:1324|Arabidopsis thaliana SSSWDNGCS >gi|79571777|ref|NP_181532.2| NRPE1|1324:1337|Arabidopsis thaliana SGGSEWGVSKSTGG >gi|79571777|ref|NP_181532.2| NRPE1|1337:1351|Arabidopsis thaliana GEANPESNWEKTTNV >gi|79571777|ref|NP_181532.2| NRPE1|1351:1358|Arabidopsis thaliana VEKEDAWS >gi|79571777|ref|NP_181532.2| NRPE1|1359:1369|Arabidopsis thaliana SWNTRKDAQES >gi|79571777|ref|NP_181532.2| NRPE1|1369:1385|Arabidopsis thaliana SSKSDSGGAWGIKTKDA >gi|79571777|ref|NP_181532.2| NRPE1|1386:1404|Arabidopsis thaliana DADTTPNWETSPAPKDSIV >gi|79571777|ref|NP_181532.2| NRPE1|1404:1420|Arabidopsis thaliana VPENNEPTSDVWGHKSV >gi|79571777|ref|NP_181532.2| NRPE1|1420:1427|Arabidopsis thaliana VSDKSWDK >gi|79571777|ref|NP_181532.2| NRPE1|1428:1434|Arabidopsis thaliana KNWGTES >gi|79571777|ref|NP_181532.2| NRPE1|1435:1443|Arabidopsis thaliana APAAWGSTD >gi|79571777|ref|NP_181532.2| NRPE1|1443:1455|Arabidopsis thaliana DAAVWGSSDKKNS >gi|79571777|ref|NP_181532.2| NRPE1|1456:1474|Arabidopsis thaliana ETESDAAAWGSRDKNNSDV >gi|79571777|ref|NP_181532.2| NRPE1|1474:1491|Arabidopsis thaliana VGSGAGVLGPWNKKSSET >gi|79571777|ref|NP_181532.2| NRPE1|1491:1504|Arabidopsis thaliana TESNGATWGSSDKT >gi|79571777|ref|NP_181532.2| NRPE1|1505:1512|Arabidopsis thaliana KSGAAAWN >gi|79571777|ref|NP_181532.2| NRPE1|1513:1521|Arabidopsis thaliana SWDKKNIET >gi|79571777|ref|NP_181532.2| NRPE1|1521:1536|Arabidopsis thaliana TDSEPAAWGSQGKKNS >gi|79571777|ref|NP_181532.2| NRPE1|1537:1546|Arabidopsis thaliana ETESGPAAWG >gi|79571777|ref|NP_181532.2| NRPE1|1547:1555|Arabidopsis thaliana AWDKKKSET >gi|79571777|ref|NP_181532.2| NRPE1|1555:1572|Arabidopsis thaliana TEPGPAGWGMGDKKNSET >gi|79571777|ref|NP_181532.2| NRPE1|1572:1589|Arabidopsis thaliana TELGPAAMGNWDKKKSDT >gi|79571777|ref|NP_181532.2| NRPE1|1589:1600|Arabidopsis thaliana TKSGPAAWGSTD >gi|79571777|ref|NP_181532.2| NRPE1|1600:1612|Arabidopsis thaliana DAAAWGSSDKNNS >gi|79571777|ref|NP_181532.2| NRPE1|1613:1629|Arabidopsis thaliana ETESDAAAWGSRNKKTS >gi|79571777|ref|NP_181532.2| NRPE1|1630:1639|Arabidopsis thaliana EIESGAGAWG >gi|79571777|ref|NP_181532.2| NRPE1|1640:1651|Arabidopsis thaliana SWGQPSPTAEDK >gi|79571777|ref|NP_181532.2| NRPE1|1651:1670|Arabidopsis thaliana KDTNEDDRNPWVSLKETKSR >gi|79571777|ref|NP_181532.2| NRPE1|1671:1686|Arabidopsis thaliana EKDDKERSQWGNPAKK >gi|79571777|ref|NP_181532.2| NRPE1|1687:1697|Arabidopsis thaliana FPSSGGWSNGG >gi|79571777|ref|NP_181532.2| NRPE1|1697:1714|Arabidopsis thaliana GGADWKGNRNHTPRPPRS
Sequence motifs are aligned to center the Trp. The Trp residue is a midpoint at position 0, and flanking non-Trp residues spread through positions in both directions, N- and C-termini.
DSALGEPKFEDSADFQNLHDEGKPSGANWEKS---------- -------------------------SSSWDNGCS-------- ----------------------FPSSGGWSNGG--------- -----------------------SGGSEWGVSKSTGG----- -----------------------VSDKSWDK----------- ------------------VGSGAGVLGPWNKKSSET------ ---------------------------SWDKKNIET------ ---------------------------AWDKKKSET------ ------------------TELGPAAMGNWDKKKSDT------ ------------------------APAAWGSTD--------- ---------------------TKSGPAAWGSTD--------- --------------------ETESGPAAWG------------ ---------------------TEPGPAGWGMGDKKNSET--- ------------------------DAAVWGSSDKKNS----- --------------------ETESDAAAWGSRDKNNSDV--- ------------------------DAAAWGSSDKNNS----- --------------------ETESDAAAWGSRNKKTS----- ----------------------KSGAAAWN------------ --------------------EIESGAGAWG------------ ---------------------TDSEPAAWGSQGKKNS----- ---------------------TESNGATWGSSDKT------- --------------------------KNWGTES--------- ---------------------------SWGQPSPTAEDK--- -------------------SSKSDSGGAWGIKTKDA------ -----------------VPENNEPTSDVWGHKSV-------- ---------------------DADTTPNWETSPAPKDSIV-- -------------------EKDDKERSQWGNPAKK------- ----------------------VEKEDAWS------------ --------------------GEANPESNWEKTTNV------- ---------------------------SWNTRKDAQES---- ------------------------GGADWKGNRNHTPRPPRS ------------------KDTNEDDRNPWVSLKETKSR---- *
As subsequences differ in lengths, the weights of Trp-flanking residues
decrease.
Observed frequencies (Pobs) of all non-tryptophan residues flanking Trp were
obtained from counts of amino acids within each column of the profile, as follows: Pobsia = Nia/N
,
where i
- each of the position in the motif sequence; a
- each of the amino acids present in the given
position i; Nia
- number of occurrences of amino acid a at given position i; N
- number of the motif
sequences. The observed frequencies were compared to the corresponding expected frequencies
(Pexp) obtained from background subsequences in UniProt and used to calculate a log-odds
according to the following formula: Dia = 2 x log2(Pobs/Pexp)
.
Scoring W-containing motifs
As a starting point, locations of Trp residues are found in a protein sequence. In analogy to the extension of alignments in BLAST algorithm, the scoring progresses in the left and right directions of each seed Trp one residue at a time. For each direction, the cumulative scores are calculated using PSSM values for every position. The motif extension does not stop until the running score reaches the maximum accumulated score or meets another W residue. Both scores from the left and right directions are summed and the highest-scoring single motif is returned. The overlapping motifs are joined and the score value for the assembled domain is calculated.
We formulate the AGO-binding activity as a binary classification problem, where W-motif is labeled with 1 if it is a AGO-binding site or 0 otherwise. We employ the sliding window technique to encode the amino acid residues flanking Trp residues of proteins. Each amino acid residue is characterized by number of descriptors including properties of amino acid residues, as well as two distances to nearest Trp residues at N- and C-termini from motif. In other words, whether Trp reisude belongs to the AGO-binding class or not is determined by its neighbor residues context and distances to nearest Trp.
The performance of the prediction quality was evaluated by 10-fold cross-validation experiments. For example, when window parameter of 11 amino acids is selected, the RF-based classifier achieves sensitivity of 94.29%, specificity of 99.68% and precision 99.01%. The prediction performance is slightly improved (SN = 95.17%; SP = 99.94%; PPV = 99.68%) with combination of three residue features: flexibility, hydrophilicity and volume, as the descriptor of the AGO binding information