SPD Home Page
SPD Home Page

SPD frequently asked questions

"*" means possibly most frequently asked questions.

What does SPD do? Top

  Secreted proteins, such as cytokines, chemokines, hormones, digestive enzymes, antibodies and components of the extra-cellular matrix, etc., are secreted from the cell into the extracellular media and play pivotal biological regulatory roles and are the most important sources for protein therapeutics. Such a class of proteins are refereed as secretome. In recent years, some groups began to study secreted proteins from the genomic perspective and made some potential secreted proteins public. However no integrated web-accessible resource is available until now.

  With the revised secreted protein prediction method and a comprehensive data source, including SwissProt, TrEMBL, RefSeq, ENSEMBL and CBI Gene, we setup secretomes of human, mouse and rat, amounting to 18152 secreted proteins. All the entries are ranked according to the prediction confidence, and annotated and classified into different functional categories. To make the dataset more comprehensive, nine relational dataset are also collected.

How do I cite this database? Top

   Please cite: Chen Y, Zhang Y, Yin Y, Gao G, Li S, Jiang Y, Gu X, Luo J (2005) SPD--a web-based secreted protein database. Nucleic Acids Res 33 Database Issue:D169-173.

   In addition, in the deveopment of SPD, one important component 'CJ-SPHMM' was used, which was reported in
   Chen, Y., Yu, P., Luo, J. and Jiang, Y. (2003) Secreted protein prediction system combining CJ-SPHMM, TMHMM, and PSORT. Mamm Genome, 14, 859-865.

What data and how much does SPD contain? Top

  As mentioned above 18152 secreted proteins are collected via a bioinformatic pipeline, which composes the core data of SPD (see also Data statistics page for more information).

  To make the dataset more comprehensive, entries from nine related datasets were also included:

  1. http://npd.hgu.mrc.ac.uk/ (2252 sequences)
  2. http://bioinfo.si.hirosaki-u.ac.jp/~TMPDB/ (302 sequences)
  3. http://www.cbs.dtu.dk/databases/NESbase/ (70 sequences)
  4. http://www.bioinfo.tsinghua.edu.cn/dbDBSubLoc.html (64094 sequences)
  5. SPDI: The secreted protein discovery initiative (1047 sequences)
  6. Riken mouse secretome (1610 sequences)
  7. Human and pufferfish secreted proteins (28078 sequences)
  8. Known SwissProt vertebrate secreted proteins (human, mouse and rat excluded) (2813 sequences)
  9. Secreted proteins extracted from GOA assignment (2938 sequences)
  The first eight dataset are downloaded on Apr.2004, and datasets from GOA are downloaded on Jul.2004. Due to modification or update of databases, protein number of SPDI, Riken or human/pufferfish secretome might be different with original number, while paper was published.

  TMPDB, NPD, NESbase and DBSubLoc are downloaded from their own website, which collects transmembrane proteins, nuclear proteins, proteins with nuclear export signal, and subcellular location respectively.

  For SPDI, Riken mouse secretome and human/pufferfish secretome, they are provided as the supplementary materials. Some sequnces of SPDI and Riken are sorted to membrane. We download these files, and retrieve the corresponding sequnces from Genbank, Riken Fantom collection, and IPI/JGI/Genbank respectively.

  As for vertebrate secretome, they are pulled out from SwissProt with keywords query, such as "secreted", etc. Then manual check one by one is carried out. Finally human/mouse/rat sequences are excluded, for they have been collected in SPD core dataset.

  As for the last one, secreted proteins from GOA, they are proteins with such GO assignment, 0005578(extracellular matrix) and 0005615(extracellular space) as well as their child terms.

  Generally speaking, the first three dataset could serve as negative control, and our sequences should not occur in them. By contrast, the last five could serve as positive control. Most of them are included in our dataset. As for DBSubLoc, it is useful in both aspects.

  Taking SPDI as an example, 307 secreted proteins have been matched by human division of SPD with overall identity not less than 90% for at least 80% length (4). Most entries of SPDI are membrane receptors or other non-secreted proteins, and only about 400 entries are secreted proteins. The core dataset covers about 75% of these SPDI secreted proteins. Taking Riken mouse secretome as another example (3), the coverage declines to 65% with identical cutoff, for Riken includes some sequences SWISS-PROT, TrEMBL, ENSEMBL, Refseq and CBI-Gene does not contain (8-10). On the contrary, among 2252 NPD sequences, 1967 are from human, mouse and rat, among which, only 95 (5%) are matched by SPD core dataset (11). As for 49 entries from TMPDB (12), 6 entries are matched by the core dataset, 2 of which are polytopic. In total, SPD core dataset includes 65-75 percent of positive control datasets and 5-8 percent of negative control datasets.

  All these nine datasets compose a comprehensive reference dataset, and it is also searchable in SPD via blast. Then users could jump to similar SPD core sequences. Additionally, a tribe-mcl based cluster mechanism are implemented, which integrates the core dataset and nine reference datasets together.

Is SPD core dataset enough comprehensive? Top

  Taking SPDI collection as an example, over 320 secreted proteins have been covered by human division of SPD. Since most entries of SPDI are membrane receptors or other non-secreted proteins, and only about 400 entries are be secreted outside, SPD covers about 80% of SPDI secreted proteins.

  Taking Riken mouse secretome as another example, over 1200 entries are covered by mouse division of SPD (Riken total number: about 1600).

How to use SPD? Top

  If you know the protein's accession number of SwissProt, Trembl, Ensembl, RefSeq or CBI Gene, just search it from text search interface. Protein's function description is also searchable. If you have the sequence of protein or DNA, just search it and check whether there is identical or similar entry in SPD core dataset or reference dataset.

  If you are interested in overall distribution of secreted protein on chromosomes or in Gene Ontology catalog, Chromosomal browser and GO browser will be helpful. The former might show the corresponding proteins according to special combination of species, confidence rank and data source. The latter could be browse-able at different GO hierarchy level.

  From the search page or browser page, you can navigate to the entry page of special protein. General information, such as function, sequence, etc., and comparative information could be shown. (see also entry layout and tips of judging whether this protein is positive)

  The head bar at each page will be used to return to the home page, and it also provides a search interface. Via "quick search", you can jump to the protein of interest, if you know the protein ID. As for the foot bar, it shows copyright and contact information as well as update time of each page.

  In download section, most information could be free to download.

How to construct SPD? Top

  A schematic figure is shown as below:

  After such a pipeline, functional classification is performed.

  1. First we extracted all vertebrate secreted proteins from SwissProt database, then we classfied them into 12 classes: antibiotic protein, apolipoprotein, casein, cytokine, hormone, immune system protein, neuropeptide and defence peptide, protease, protease inhibitor, toxin, wnt protein, and all other secreted protein. In the following steps, we will use above 11 classes of known secreted proteins to classify those novel ones.
  2. According to cross-link information in SwissProt we got PROSITE, PFAM, SMART, PRINTS entries that can stand for secreted proteins.
  3. Then we obtained PROSITE, PFAM, SMART, PRINTS entries which can stand for only one of 11 classes secreted proteins, such as wnt entry in PFAM, which can only stand for wnt protein and if one protein had wnt domain, it must belong to wnt kind protein.
  4. Use BLAST to compare our predicted novel secreted proteins and above 11 classes of proteins: if a novel secreted protein is similar to a known one (Identity>=50% and Cover length>=80%) and this known protein belongs to class A (one of 11 secreted protein classes), we will put the novel one into class A.
  5. Novel secreted proteins which are not classified in Step 4 were compared to PROSITE, PFAM, SMART, PRINTS entries obtained in Step 3: if a novel protein comprise a motif or domain which can only stand for class A (one of 11 secreted protein classes), this protein will be put into class A.
  6. (Some domains are specific to a cellular compartment, which has been already described in " Predicting protein cellular localization using a domain projection method").
  7. After above 5 steps, some novel secreted proteins will be classified into 11 classes. However there are still many ones than cannot be classified accurately. For those proteins, we classified them into 3 parts: proteins in the first part (called ONLY12) comprise motifs or domains (Step 2) which can stand for known secreted proteins; proteins in the second part (called ONLY13) can be annotated by other entries of PROSITE, PFAM, SMART, PRINTS and COG; proteins in the third part (called ONLY14) cannot be annotated by any entries in PROSITE, PFAM, SMART, PRINTS and COG.
    Most proteins are classified into ONLY12, ONLY13 and ONLY14. This is reasonable for there are not many representative domains.
    The representative domain list as well as protein classification information could be downloaded.
  8. At the browse page, short, pink columns means those chromosomal fragments, which have not been assembled, but where they are from are known. (see also UCSC)
    Some adjacent loci might overlap each other. In this case, just click this location, and browse such region in "Loci Cluster" field, which shows a small scale.
    Another question is 7% sequences will not occure in this map, for they are not mapped to chromosomes due to low quality of protein sequences or chromosomal assemblies. Search module could compensate to some extent.

What fields does an SPD protein web page contain? Top

The entry page are organized as one top bar and four groups:

  • Data source: SwissProt, Trembl, Refseq, Ensembl or CBI gene.
  • Name (official name): For SwissProt, Trembl and Refseq, the name is parsed from ucsc_kgXref table. For Ensembl, the name is retrieved by Ensmart batch query.
  • Cross reference: For Ensembl protein, it means so-called Ensembl transcripts, which are also retrieved by Ensmart batch query. For SwissProt or Trembl protein, it is parsed out from "DR" line of the SwissProt format data file. As for RefSeq protein, it is parsed out from "DBSOURCE" field of GenBank file. Finally, HPI protein points directly local HPI website.
  • Description: Parsed out from the original data source if exists. And "Reference Number" means the number of papers which describes such protein. This number could be used to reflect how much we know about this protein.
  • GO assignment: For SwissProt, Trembl and Refseq, this information is parsed out from the original data source. Ensembl's go assignment is also retrieved via Ensmart query.If multiple GO entries are assigned to this protein, one schematic figure is shown, which display the relationship between these GO entries. Once one entry is sub-node (part_of/is_a) of another entry, a directed line will be drawn to connect both entries.However, one point is worth mentioning. That is, GO assignment evidence is very important. Detailed speaking, cellular component information labeled with IEA (inferred from electronic annotation) or NR (note recorded) tends to be not much reliable.

  • Rank/category, just see the above pipeline and category.
  • Domain architecture: According to the annotation information from PCAS, a schematic figure could be drawn. First of all, the region with the most significant match (the lowest E-value) is drawn. Then, the second most significant match could be shown, if it does not overlap with previous regions. The dashed line is used to reflect how long this protein is. Here, only if alignment evalue is not greater than 0.05, InterPro entries will be shown and overlaps between InterPro families are not taken into consideration. So, it is likely that more InterPro IDs are shown and fewer regions are drawn. Another noteworthy point is PCAS only display five top hits by default. This is why sometimes this fields display more InterPro entries than PCAS. Certainly, this could be changed via click "adjust hit number limit for more hits".
  • Gene structure: Via Blat against UCSC HG16, mm4 and rn3 respectively, chromosomal coordinates are known. Here, for human, identitiy cutoff 0.99 and length coverage cutoff 0.95 are employed. As for mouse and rat, the identitiy cutoff declines to 0.95, because the quality of assembly is relatively lower. The red line means the corresponding chromosomal fragment, which is labeled with coordinates, length and genetic band. For human and mouse, if the part lies in the synteny region between human and mouse, it will be also shown. The dispersed black or grey bar means exons. With mouse over these bars, detailed information of this exon will be shown.
    Loci cluster means adjacent locus (distance<0.5MB) on the chromosome.Its legend seems to be like "ClusterID, coordinates, length".
  • Homolog Cluster: As shown in the pipeline, complete identical sequences are excluded after initial collection. Then sequences of SPD are grouped into thousands of clusters (similar sequences) via pairwise blastp (with overall identity cutoff 90%, overall length coverage 90%). This fields might be divided into two sub-fields. One is used to show those entries from the same species, which might be variant from different data sources, or paralogs. Another one is used to display entries from different species, which might be ortholog proteins. With mouse over the click, identity, coverage, and E-value will appear. By contrast with the final section, this field only include those high similar sequences (possibly homologous sequences) from SPD core dataset.

  • SPD Cross-Reference: the third group is used to display those similar sequences from reference dataset. By default, only sequences with overall identity not little than 90% could be shown. For reference sequence, such a field is used to show those core sequence similar with this entry. Overall identity 90% is the default cutoff. By contrast with the following "Protein family" section, this field is used to display similarity between core dataset and reference dataset, while the latter is yielded via pair-wise comparison between all the sequences, i.e., similarity between sequences within core dataset or reference datasets will be shown.

  • Protein family: the final group is used to display families based on tribe-mcl, which has been reported to handle problem sequences such as multidomain proteins and partial sequences. Pair-wise blastp is run between both SPD core dataset and nine reference datasets (in total, about 100000 sequences), then the results are processed via tribe-mcl and 12709 clusters are retrieved. Among all 18152 SPD core entries, 17270 sequences take part in this clustering. Here blast evalue is set to 1e-10 and inflation value of tribe-mcl is set to 5 according to existing reports.
    During this field, a unique family_id as well as its members will be listed ordered by data source. A general format appears to be like: a serial number, protein_ID, description and data source. For SPD core entries, protein name, rank value as well as species will be shown. For Subloc entries, cellular location together with the organism will be displayed.
    The cluster information could be downloaded.

Is SPD enough comprehensive? Top

  Taking SPDI collection as an example, over 320 secreted proteins have been covered by human division of SPD. Since most entries of SPDI are membrane receptors or other non-secreted proteins, and only about 400 entries are be secreted outside, SPD covers about 80% of SPDI secreted proteins.

  Taking Riken mouse secretome as another example, over 1200 entries are covered by mouse division of SPD.

Why do semi-automatic classification and full-automatic cluster exist at the same time? Top

  There are two classification mechanisms in SPD, semi-automatic, domain-based classification method, and full-automatic, sequence-based classification method (see also construction pipeline and protein families).

  As for the former, users could deduce directly the general function of a protein according to its classification division. As for the latter, the quick speed of tribe-mcl makes it possible to cluster both core dataset and reference datasets, which could be helpful to predict the function via combination of all the sequences. Another usage is to check the overlap or coverage between SPD core dataset and nine reference datasets. In a word, both pipelines are complementary to some extent, which could be used to provide as much information as possible.

How to judge whether a SPD protein is a true positive? Top

  Many kinds of proofs are provided.

  1. Rank system: Obviously, protein of Rank0 or Rank1 tends to be more convincing than those of Rank2 or Rank3 (see also pipeline)
  2. Category assignment: Proteins which are classified into these secreted protein related functional categories, ie, only 11, only 12, only 13 as well as only 14 excluded, are usually more convincing. (see also category)
  3. Protein family: According to cluster information from core dataset with reference datasets via tribe-mcl, users could make some judgment. Namely, a SPD secreted protein might be more reliable, if it is grouped into a cluster, which comprises many entries from GO secreted proteins, Riken mouse secretome, etc. On the contrary, a SPD entry is likely to be false positive, if it is clustered together with entries from negative control dataset, such as TMPDB, NPD, etc. (see also Protein family)
  4. GO assignment: Proteins with GO assignment like "extracellular space" or "extracellular matrix" are likely true positives. By contrast, proteins with assignment like "integral to membrane" seems to be false positive. (see also go)
  5. Protein cluster/chromosomal loci cluster: If multiple similar proteins co-occur in SPD, they tend to be reliable. So does those entries from synteny region.(see also cluster and gene)

What are main features or highlights of SPD? Top

  • SPD tends to be inclusive not exclusive. In other words, SPD is to collect as many secreted proteins as possible, which is realized in several aspects. First of all, all distinct sequneces are remaind when excluding redudant sequences, for it is difficult to make a judgement which sequence is the correct variant. And further cluster mechanism will help to reflect such possible redundancy. Second, rank system helps to deposit all the proteins into database discriminatingly. Finally, introduction of nine reference datasets helps to increase the coverage of SPD.
  • SPD has been tuned for those biologists who are interested in identifying novel secreted proteins. First, mRNA or cDNA sequences are given in "cross-reference" field. Second, reference number is shown in "description" field, which reflects whether this protein is novel or not. All of them are available for batch download. (see also cross-reference and description)

What are limits or shortages of SPD? Top

  As described in construction section, SPD prediction pipeline is designed for classical secretory pathway. And, it does not work in case of proteins without an N-terminal signal peptide, such as fibroblast growth factors, interleukins, and galectins. However, so far, only a limited number of proteins are found to enter such an non-classical pathway. And existence of rank0 (known SwissProt secreted protein collected via keyword search and manual check), and import of reference dataset(especially DBSubLoc, known vertebrate secreted protein, and GOA datasets) might make up to some extent. If you are interested in these non-classical secreted proteins, this article, "Feature based prediction of non-classical and leaderless protein secretion", may be helpful.

Could SPD data be downloaded? Top

  Yes, of course. Users could get most information of SPD from the Download section. Please read the "readme.txt" first.

   Additionally, users could get the sequences of interest from the retrieval page of SPD-text search.

Where can we find related resource? Top

  Just see "Protein localization and targeting" of NAR database categories.

What shell I do if running into questions not contained in this list? Top

  Questions or advice, please feel free to contact us via the following email at bottom of each page. We'll try to make a reply in one work day.

CopyRight © Center for Bioinformatics, Peking University
Any questions, please do not
hesitate to contact us via this mail adress.