※ User Guide:


Frequently Asked Questions:

1. Q: How to use GPS 3.0 software?

A: You can find the latest version of GPS 3.0 at http://gps.biocuckoo.org/download.php. Then download and install the GPS 3.0 software to your computer.Currently, GPS 3.0 is implemented in JAVA and could be installed on a computer with Windows/Linux/Unix/Mac OS . And we also wrote a manual for users which included in the installation package.

 

2. Q: What's the difference between simple prediction and comprehensive prediction?

A: The only difference between simple prediction and comprehensive prediction is that the simple prediction didn't provide annoations of surface accessbility and secondary structure. The annoations of surface accessbility and secondary structure were provided by NetSurfP ver. 1.1 [PMID: 19646261], which needs long-time computation. So, in the simple prediction, the surface accessbility and secondary structure are not visulized.

 

3. Q: How to read the GPS 3.0 results?

A: Here we use the human protein BimEL as the example. After clicking "Submit", the prediction results will be shown as follows:


<1>. The table of the GPS 3.0 results (Page 1)

ID: The name/id of the protein sequence that you input to predict.

Position: The position of the site which is predicted to be phosphorylated.

Code: The residue which is predicted to be phosphorylated.

Kinase: The regulatory kinase which is predicted to phosphorylate the site.

Peptide: The predicted phosphopeptide with 7 amino acids upstream and 7 amino acids downstream around the modified residue.

Score: The value calculated by GPS algorithm to evaluate the potential of phosphorylation. The higher the value, the more potential the residue is phosphorylated.

Cutoff: The cutoff value under the threshold. Different threshold means different precision, sensitivity and specificity.


<2>. The visualization of simple prediction

Part 1: The visualization for protein disordered region predicted by IUPred [PMID: 15955779]. Cutoff = 0.5, if score of prediction > cutoff, the residue is considered in disordered region.

Part 2:
Left: The visualization for the positional distribution of the predicted site in protein sequence.
Middle left: The distribution of S/T/Y sites in kinase groups.
Middle right: The distribution of S/T/Y sites.
Right: The distribution of S/T/Y sites in disordered region.


<3>. The visualization of comprehensive prediction

Part1:
Top: The surface accessbility of amino acids and the protein disordered region were predicted by NetSurfP ver. 1.1 (PMID: 19646261) and IUPred (PMID: 15955779), respectively. The cutoff of disordered region prediction = 0.5, if score of prediction > cutoff, the residue is considered in disordered region. The cutoff of surface accessbility prediction = 0.25, if score of prediction > cutoff, the residue is considered as surface exposed residue. Bottom: The positions of the predicted phosphorylation sites were visualized in the protein sequence together with the secondary structure predicted by NetSurfP ver. 1.1 (PMID: 19646261).

Part 2 :
Left: The distribution of S/T/Y sites in kinase groups.
Middle left: The distribution of S/T/Y sites.
Middle right: The distribution of S/T/Y sites in secondary structure.
Right: The distribution of S/T/Y sites in disordered region.

 

4. Q: Is GPS 3.0 accurate?

A: Yes, but not all. Prediction of kinase-specific phosphorylation sites is a greatly difficult problem. If the training data is enough, the prediction is satisfying and accurate. But for many protein kinases, the training data set are very limited, to make the performance lower. For kinase-specific prediction, no algorithm or approach could reach the best performances for all of the protein kinases. However, by comparison, the prediction performances of GPS are better or at least comparable with previous tools. And also, we will updated the GPS routinely to make it more accurate and powerful.

 

5. Q: How to choose the cut-off values and the thresholds?

A: Firstly, we calculated the theoretically maximal false positive rate (FPR) for each PK cluster. The three thresholds of GPS 3.0 were decided based on calculated FPRs.For serine/threonine kinases, the high, medium and low thresholds were established with FPRs of 2%, 6% and 10%. And for tyrosine kinases, the high, medium and low thresholds were selected with FPRs of 4%, 9% and 15%. The high threshold was validated by a large-scale prediction of mammalian phosphorylation sites, with a satisfying performance. And the medium threshold relaxed the stringency to be useful in small-scale experiments. Also, the low threshold reduced the Sp to improve Sn considerably to be useful in exhaustively experimental identifying all potential phosphorylation sites in substrates.

 

6. Q: What's the meaning of False Positive Rate (FPR)?

A: The false positive rate (FPR) is the proportion of negative sites that are erroneously predicted as positive hits. Given a data set containing all of non-phosphorylation sites, the real FPR could be easily computed. However, precise calculation of FPR is unavailable due to lack of a "gold-standard" negative data set. Here we developed a simple and fast method to construct the near-negative data set and estimate the theoretically maximal FPRs. Firstly, we calculated the distributions of amino acids composition in six organisms, including S. cerevisiae, S. pombe, C. elegans , D. melanogaster, M. musculus, and H. sapiens. Then we randomly generated 10,000 PSP(7,7) peptides to construct a near-negative data set based on the real frequencies of twenty amino acids in eukaryotic proteomes. Although there were a few sites to be real hits, its proportion would be very small. The process was repeated twenty times and the average FPR was calculated by GPS 2.0 as the theoretically maximal FPR. Also, the negative sites could be randomly retrieved from eukaryotic proteomes. And the results from both methods are very similar.

 

7. Q: I was trying to install the software on macbook pro but my installer says the file is damaged. How can I properly install the software in Mac OS?

A: By default, Mac OS 10.8 only allows users to install applications from 'verified sources'. In effect, this means that users are unable to install most applications downloaded from the internet. You can follow the directions below to prevent this error message from appearing.
(1) Open the Preferences. This can be done by either clicking on the System Preferences icon in the Dock or by going to Apple Menu > System Preferences.
(2) Open the Security & Privacy pane by clicking Security & Privacy.
(3) Make sure that the General section of the the Security & Privacy pane is selected. Click the icon labeled Click the lock to prevent further changes.
(4) Enter your username and password into the prompt that appears and click Unlock.
(5) Under the section labeled Allow applications downloaded from, select Anywhere. On the prompt that appears, click Allow From Anywhere.
(6) Exit System Preferences by clicking the red button in the upper left of the window. You should now be able to install applications downloaded from the internet.

 

8. Q: I have a few questions which are not listed above, how can I contact the authors of GPS 3.0?

A: Please contact the two major authors: Dr. Yu Xue and Dr. Jian Ren for details.

 

 

9. Q: I was trying to install the software in Mac OS but my installer says the file is damaged. How can I properly install the software in Mac OS?

A: By default, Mac OS 10.8 or later only allows users to install applications from 'verified sources'. In effect, this means that users are unable to install most applications downloaded from the internet. You can follow the directions below to prevent this error message from appearing.

(1) Open the Preferences. This can be done by either clicking on the System Preferences icon in the Dock or by going to Apple Menu > System Preferences.
(2) Open the Security & Privacy pane by clicking Security & Privacy.
(3) Make sure that the General section of the the Security & Privacy pane is selected. Click the icon labeled Click the lock to prevent further changes.
(4) Enter your username and password into the prompt that appears and click Unlock.
(5) Under the section labeled Allow applications downloaded from, select Anywhere. On the prompt that appears, click Allow From Anywhere.
(6) Exit System Preferences by clicking the red button in the upper left of the window. You should now be able to install applications downloaded from the internet.

 

Supplementary

An application of GPS 2.0:

A large-scale prediction of mammalian kinase-specific phosphorylation sites
As an application of GPS 2.0, we performed a large-scale prediction of kinase-specific phosphorylation sites in mammalians. The high threshold of GPS 2.0 was chosen, with a FPR of 2% for serine/threonine kinases and 4% for tyrosine kinases. The predictor for budding yeast IPL1 was not used. From Phospho.ELM 6.0, there were 13,254 mammalian sites, including 9,717 S, 1,818 T and 1,719 Y sites (see Table 1).

Table 1 - Data statistics of a large-scale prediction for mammalian kinase-specific sites. From Phospho.ELM 6.0, there were 13,250 sites identified in mammalians. And GPS 2.0 could predict 12,403 sites with at least one cognate kinase.

Phospho.ELM 6.0
Mammalian
Predicted
Coverage
Total
Sites
13254
12219
92.19%
Pro.
4291
4071
94.87%
S
Sites
9717
9195
94.63%
Pro.
3444
3325
96.54%
T
Sites
1818
1551
85.31%
Pro.
1200
1048
87.33%
Y
Sites
1719
1473
85.69%
Pro.
885
768
86.78%

These sites were experimentally identified, but the kinase information of more than 10,000 sites still remained to be annotated. We divided the data set into three groups, the known substrates of a PK for prediction (Known sub.), the known substrates of other kinases (Other's sub), and the sites without PK information (Unknown sub.). For example, there were 306 sites experimentally identified as PKA sites in mammalians. And 1,993 sites were verified as substrates of other PKs, with 9,236 un-annotated sites. For the first group (Known sub.), the sensitivity (Sn) was calculated to depict the proportion of which we can correctly predict for the existed sites. And for the latter two groups, the precision (Pr) was calculated to estimate the minimal accuracy for large-scale predictions, respectively.

For 143 serine/threonine and 69 tyrosine PK groups, the Sn values for known substrates and Pr values for unknown data were calculated, respectively. Most of the prediction results obtained satisfying performances. For example, GPS 2.0 could predict 200 of 306 known PKA sites as positive hits, with a Sn of 65.36%. And for 1,993 sites phosphorylated by other PKs, GPS 2.0 could predict 220 of them as positive hits, with a Pr of 81.88%, which meant at least 81.88% of the 220 predicted sites might be positive sites. Again, for 9,236 un-annotated sites, GPS 2.0 could predict 959 of them as positive sites, with a Pr of 80.74%. However, if the real positive sites were very few in the full data set, or the occurrence of real positive sites was even lower than randomly generated data, the Pr value could be very small and even lower than zero. In our analysis, there were 53 PK groups (25%% of 212 PK groups) with low performances. And the prediction results of these predictors were removed in this work. Totally, there were 12,219 sites predicted with at least one PK, with a total coverage of 92.19%. The total prediction results are available: Large-scale_Prediction.txt.gz