Search for
You are here: ExPASy AU
  ------------------------------------------------------------------------
                      The PROSITE database of protein families and domains
                                                               User Manual
                                                   Release 18.0, July 2003
  ------------------------------------------------------------------------

Amos Bairoch
Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire (CMU)
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: +41-22-702 50 50
Fax: +41-22-702 58 58
Electronic mail address: prosite@isb-sib.ch
WWW server: http://www.expasy.org/

  ------------------------------------------------------------------------

Copyright notice

PROSITE is copyright. It is produced through a collaboration between the
Swiss Institute of Bioinformatics and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by
non-profit institutions as long as its content is in no way modified. Usage
by and for commercial entities requires a license agreement. For
information about the licensing scheme see: http://www.isb-sib.ch/announce/
or send an email to license@isb-sib.ch.

The above copyright notice also applies to this user manual as well as to
any other PROSITE documents.

  ------------------------------------------------------------------------

Introduction

PROSITE is a method of determining what is the function of uncharacterized
proteins translated from genomic or cDNA sequences. It consists of a
database of biologically significant sites and patterns formulated in such
a way that with appropriate computational tools it can rapidly and reliably
identify to which known family of protein (if any) the new sequence
belongs.

In some cases the sequence of an unknown protein is too distantly related
to any protein of known structure to detect its resemblance by overall
sequence alignment, but it can be identified by the occurrence in its
sequence of a particular cluster of residue types which is variously known
as a pattern, motif, signature, or fingerprint. These motifs arise because
of particular requirements on the structure of specific region(s) of a
protein which may be important, for example, for their binding properties
or for their enzymatic activity. These requirements impose very tight
constraints on the evolution of those limited (in size) but important
portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we
can say that "some regions of a protein sequence are more equal than
others" !

The use of protein sequence patterns (or motifs) to determine the
function(s) of proteins is becoming very rapidly one of the essential tools
of sequence analysis. This reality has been recognized by many authors, as
it can be illustrated from the following citations from two of the most
well known experts of protein sequence analysis, R.F. Doolittle and A.M.
Lesk:

 "There are many short sequences that are often (but not always)
diagnostics of certain binding properties or active sites. These can be set
into a small subcollection and searched against your sequence (1)".

 "In some cases, the structure and function of an unknown protein which is
too distantly related to any protein of known structure to detect its
affinity by overall sequence alignment may be identified by its possession
of a particular cluster of residues types classified as a motifs. The
motifs, or templates, or fingerprints, arise because of particular
requirements of binding sites that impose very tight constraint on the
evolution of portions of a protein sequence (2)."

Based on these observations we decided, in 1988, to actively pursue the
development of a database of patterns which would be used to search against
sequences of unknown function. This database, called PROSITE, contains a
few patterns which have been published in the literature, but the majority
have been developed, in the last ten years by the author. Originally this
dictionary was conceived as part of the author's doctoral dissertation as
well as an integral part of the PROSITE program in the PC/Gene sequence
analysis software package. But, as many people have expressed their
interest in this project, we have decided to make this work available on
computer media.

There are a number of protein families as well as functional or structural
domains that cannot be detected using patterns due to their extreme
sequence divergence; the use of techniques based on weight matrices (also
known as profiles) allows the detection of such proteins or domains. In
1994 we started a collaborative project with Philipp Bucher to introduce
profiles in PROSITE. Currently, most of the new PROSITE entries are
centered around profiles and are developed by the PROSITE collaborators at
the Swiss Institute of Bioinformatics in Geneva and Lausanne.

____________________

1) Doolittle R.F.
   (In) Of URFs and ORFs: a primer on how to analyze derived amino acid
   sequences., University Science Books, Mill Valley, California, (1986).
2) Lesk A.M.
   (In) Computational Molecular Biology, Lesk A.M., Ed., pp17-26, Oxford
   University Press, Oxford (1988).

  ------------------------------------------------------------------------

Citation

If you want to refer to PROSITE in a publication you can do so by citing:

     Hofmann K., Bucher P., Falquet L., Bairoch A.
     The PROSITE database, its status in 1999
     Nucleic Acids Res. 27:215-219(1999).

  ------------------------------------------------------------------------

Feedback

We welcome any feedback. If you find errors, omissions, or if you want to
suggest new sites or patterns to be added to this dictionary, please let us
know. You can contact us (by electronic mail preferably) at the address
listed above.

  ------------------------------------------------------------------------

                             Table of contents

1. Methodology
     1.1. Methodology for the development of pattern entries
          1.1.1. Introduction
          1.1.2. Patterns from the literature
          1.1.3. Steps in the development of a new pattern
     1.2. Methodology for the development of profile entries

2. Conventions used in the database
     2.1. General structure
     2.2. Data file structure
          2.2.1. Structure of an entry
          2.2.2. Example of a pattern entry
          2.2.3. Example of a profile (matrix) entry
     2.3. The different line types
          2.3.1. The ID line
          2.3.2. The AC line
          2.3.3. The DT line
          2.3.4. The DE line
          2.3.5. The PA line
          2.3.6. The MA line
          2.3.7. The RU line
          2.3.8. The NR line
          2.3.9. The CC line
          2.3.10. The DR line
          2.3.11. The 3D line
          2.3.12. The DO line
          2.3.13. The termination line
     2.4. Documentation file structure

  ------------------------------------------------------------------------

                              1. Methodology

     1.1. Methodology for the development of pattern entries

          1.1.1. Introduction

In this section we will explain how we selected or developed the signature
patterns described in this compilation. Our first and most important
criterion is that a good signature pattern must be as short as possible,
should detect all or most of the sequences it is designed to describe and
should not give too many false positive results. In other words it must
exhibit both high sensitivity and high specificity.

          1.1.2. Patterns from the literature

A number of the patterns described in this dictionary have been published.
We have tested those patterns on the Swiss-Prot knowledgebase to see if the
signature pattern was still specific to the group of family of proteins
since the paper was published. If this was the case we used the published
pattern as such, otherwise we updated the pattern using methods similar to
those used to develop a new pattern and which are described in the
following sub-section.

          1.1.3. Steps in the development of a new pattern

We generally start by studying review(s) on a group or family of proteins.
We build an alignment table of the proteins discussed in that review. If
necessary we add to this table new published sequences relevant to the
subject under consideration. Using such alignment tables we pay particular
attention to the residues and regions thought or proved to be important to
the biological function of that group of proteins. These biologically
significant regions or residues are generally:

- Enzyme catalytic sites.
- Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin,
  etc).
- Amino acids involved in binding a metal ion.
- Cysteines involved in disulfide bonds.
- Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA,
  etc.) or another protein.

We then try to find a short (not more than four or five residues long)
conserved sequence which is part of a region known to be important or which
include biologically significant residue(s). We call the pattern(s) created
at this stage the 'core' pattern(s). The most recent version of the
Swiss-Prot knowledgebase is then scanned with these core pattern(s). If a
core pattern will detect all the proteins under consideration and none (or
very few) of the other proteins, we can stop at this stage and use the core
pattern as a bona fide signature. In most cases we are not so lucky and we
pick up a lot of extra sequences which clearly do not belong to the group
of proteins under consideration. A further series of scans, involving a
gradual increase in the size of the pattern, is then necessary. In some
cases we never manage to find a good pattern and we have to retry with a
core pattern from a different part of the sequence. It must also be noted
that we take particular attention to try to avoid 'false' patterns. We will
use an example to describe what we call a 'false' pattern:

Let us assume that we have a partial alignment of three sequences around an
active site residue (in this example an histidine whose position is marked
with an asterisk) as shown below:

                    *
             ALRDFATHDDF
             SMTAEATHDSI
             ECDQAATHEAS

Here we would start scanning with a core pattern with the sequence A-T-H-[D
or E]. This pattern is small and would probably pick up too many false
positive results. According to the procedure outlined above, we would then
have to extend the core pattern. But in this case, any extension would be
artificial and group together residues which have different properties and
which are represented only once in a given position of the alignment. For
example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or
A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences
which are in the alignment, but it would be biologically meaningless; there
is no consensus in the first three positions of the pattern and the pattern
does not even group residues with identical physicochemical properties.
Consequently, this pattern would probably fail to detect a new sequence
containing the same active site but having a different N-terminal sequence.


     1.2. Methodology for the development of profile entries

A profile or weight matrix (the two terms are used synonymously here) is a
table of position-specific amino acid weights and gap costs. These numbers
(also referred to as scores) are used to calculate a similarity score for
any alignment between a profile and a sequence, or parts of a profile and a
sequence. An alignment with a similarity score higher than or equal to a
given cut-off value constitutes a motif occurrence. As with patterns, there
may be several matches to a profile in one sequence, but multiple
occurrences in the same sequences must be disjoint (non-overlapping)
according to a specific definition included in the profile.

The profile structure used in PROSITE is similar to but slightly more
general than the one introduced by Gribskov and co-workers (3). Additional
parameters allow representation of other motif descriptors, including the
currently popular hidden Markov models. A technical description of the
profile structure and of the corresponding motif search method is given in
the file PROFILE.TXT included in each PROSITE release.

Profiles can be constructed by a large variety of different techniques. The
classical method developed by Gribskov and co-workers (4) requires a
multiple sequence alignment as input and uses a symbol comparison table to
convert residue frequency distributions into weights. The profiles included
in the current PROSITE release were generated by this procedure applying
recent modifications described by Luethy and co-workers (5). In the future,
we intend to apply additional profile construction tools including
structure-based approaches and methods involving machine learning
techniques. We also consider the possibility of distributing published
profiles developed by others in PROSITE format along with locally produced
documentation entries.

Unlike patterns, profiles are usually not confined to small regions with
high sequence similarity. Rather they attempt to characterize a protein
family or domain over its entire length. This can lead to specific problems
not arising with PROSITE patterns. With a profile covering conserved as
well as divergent sequence regions, there is a chance to obtain a
significant similarity score even with a partially incorrect alignment.
This possibility is taken into account by our quality evaluation
procedures. In order to be acceptable, a profile must not only assign high
similarity scores to true motif occurrences and low scores to false
matches. In addition, it should correctly align those residues having
analogous functions or structural properties according to experimental
data.

Profiles are supposed to be more sensitive and more robust than patterns
because they provide discriminatory weights not only for the residues
already found at a given position of a motif but also for those not yet
found. The weights for those not yet found are extrapolated from the
observed amino acid compositions using empiric knowledge about amino acid
substitutability. The effect of such a procedure is exemplified below.

Shown are a short alignment without gaps and the corresponding weighting
table derived with our standard method.

                  F   K   L   L   S   H   C   L   L   V
                  F   K   A   F   G   Q   T   M   F   Q
                  Y   P   I   V   G   Q   E   L   L   G
                  F   P   V   V   K   E   A   I   L   K
                  F   K   V   L   A   A   V   I   A   D
                  L   E   F   I   S   E   C   I   I   Q
                  F   K   L   L   G   N   V   L   V   C

          A     -18 -10  -1  -8   8  -3   3 -10  -2  -8
          C     -22 -33 -18 -18 -22 -26  22 -24 -19  -7
          D     -35   0 -32 -33  -7   6 -17 -34 -31   0
          E     -27  15 -25 -26  -9  23  -9 -24 -23  -1
          F      60 -30  12  14 -26 -29 -15   4  12 -29
          G     -30 -20 -28 -32  28 -14 -23 -33 -27  -5
          H     -13 -12 -25 -25 -16  14 -22 -22 -23 -10
          I       3 -27  21  25 -29 -23  -8  33  19 -23
          K     -26  25 -25 -27  -6   4 -15 -27 -26   0
          L      14 -28  19  27 -27 -20  -9  33  26 -21
          M       3 -15  10  14 -17 -10  -9  25  12 -11
          N     -22  -6 -24 -27   1   8 -15 -24 -24  -4
          P     -30  24 -26 -28 -14 -10 -22 -24 -26 -18
          Q     -32   5 -25 -26  -9  24 -16 -17 -23   7
          R     -18   9 -22 -22 -10   0 -18 -23 -22  -4
          S     -22  -8 -16 -21  11   2  -1 -24 -19  -4
          T     -10 -10  -6  -7  -5  -8   2 -10  -7 -11
          V       0 -25  22  25 -19 -26   6  19  16 -16
          W       9 -25 -18 -19 -25 -27 -34 -20 -17 -28
          Y      34 -18  -1   1 -23 -12 -19   0   0 -18

Note that at certain positions, a residue not occurring in  the alignment
receives a higher score than one occurring in the alignment, as a result of
other residues at that position. Thus A occurring in the third column has a
lower score (-1) than M (+10) not occurring there but physicochemically
similar to L, I, V, F found in the other sequences. Similar extrapolation
procedures are used to derive position-specific insertion and deletion
scores which further enhance the selectivity of the profile.
____________________

3) Gribskov M., McLachlan AD, Eisenberg D.
   Proc. Natl. Acad. Sci. U.S.A. 84:4355-4358(1987).
4) Gribskov M., Luethy R., Eisenberg D.
   Meth. Enzymol. 183:146-159(1990).
5) Luethy R., Xenarios I., Bucher P.
   Protein Sci. 3:139-146(1994).


                    2. Conventions used in the database

     2.1. General structure

The PROSITE database is composed of two ASCII (text) files. The first file
(PROSITE.DAT) is a computer readable file that contains all the information
necessary to programs that will scan sequence(s) with patterns and/or
matrices. The second file (PROSITE.DOC) contains textual information that
fully documents each pattern and profile. We must point out that we
strongly urge software developers to build software tools that make use of
both files. A list of patterns or profiles present in a sequence is not
very useful to biologists without the relevant documentation.


     2.2. Data file structure

          2.2.1. Structure of an entry

The entries in the database data file (PROSITE.DAT) are structured so as to
be usable by human readers as well as by computer programs. Each entry in
the database is composed of lines. Different types of lines, each with its
own format, are used to record the various types of data which make up the
entry. The general structure of a line is the following:

   Characters   Content
   ----------   ----------------------------------------------------------
   1 to 2       Two-character line code. Indicates the type of information
                contained in the line.
   3 to 5       Blank
   6 up to 128  Data

The currently used line types, along with their respective line codes, are
listed below:

   ID  Identification                     (Begins each entry; 1 per entry)
   AC  Accession number                   (1 per entry)
   DT  Date                               (1 per entry)
   DE  Short description                  (1 per entry)
   PA  Pattern                            (>=0 per entry)
   MA  Matrix/profile                     (>=0 per entry)
   RU  Rule                               (>=0 per entry)
   NR  Numerical results                  (>=0 per entry)
   CC  Comments                           (>=0 per entry)
   DR  Cross-references to Swiss-Prot     (>=0 per entry)
   3D  Cross-references to PDB            (>=0 per entry)
   DO  Pointer to the documentation file  (1 per entry)
   //  Termination line                   (Ends each entry; 1 per entry)

The maximal line length in the file is currently set to 128 characters.
But, except for the "MA" line, all the other lines never extend further
than 78 characters.

Each of the line-types are described in section 2.3 of this document.


          2.2.2. Example of a pattern entry

ID   CUTINASE_1; PATTERN.
AC   PS00155;
DT   APR-1990 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).
DE   Cutinase, serine active site.
PA   P-x-[STA]-x-[LIV]-[IVT]-x-[GS]-G-Y-S-[QL]-G.
NR   /RELEASE=40.7,103373;
NR   /TOTAL=12(12); /POSITIVE=12(12); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=0; /PARTIAL=0;
CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=1;
CC   /SITE=11,active_site;
DR   Q10837, CUT1_MYCTU, T; Q50664, CUT2_MYCTU, T; O06318, CUT3_MYCTU, T;
DR   P41744, CUTI_ALTBR, T; P29292, CUTI_ASCRA, T; P52956, CUTI_ASPOR, T;
DR   Q00298, CUTI_BOTCI, T; P10951, CUTI_COLCA, T; P11373, CUTI_COLGL, T;
DR   Q99174, CUTI_FUSSC, T; P00590, CUTI_FUSSO, T; P30272, CUTI_MAGGR, T;
3D   1AGY; 1CEX; 1CUA; 1CUB; 1CUC; 1CUD; 1CUE; 1CUF; 1CUG; 1CUH; 1CUS; 1CUU;
3D   1CUV; 1CUW; 1CUY; 1CUZ; 1FFA; 1FFB; 1FFC; 1FFD; 1FFE; 1OXM; 1XZA; 1XZB;
3D   1XZC; 1XZD; 1XZE; 1XZF; 1XZG; 1XZH; 1XZJ; 1XZK; 1XZL; 1XZM; 2CUT;
DO   PDOC00140;
//


          2.2.3. Example of a profile (matrix) entry

ID   HSP20; MATRIX.
AC   PS01031;
DT   JUN-1994 (CREATED); JUN-1994 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE   Heat shock hsp20 proteins family profile.
MA   /GENERAL_SPEC: ALPHABET='ACDEFGHIKLMNPQRSTVWY'; LENGTH=97;
MA   /DISJOINT: DEFINITION=PROTECT; N1=2; N2=96;
MA   /NORMALIZATION: MODE=1; FUNCTION=GLE_ZSCORE;
MA    R1=239.0; R2=-0.0036; R3=0.8341; R4=1.016; R5=0.169;
MA   /CUT_OFF: LEVEL=0; SCORE=400; N_SCORE=10.0; MODE=1;
MA   /DEFAULT: MI=-210; MD=-210; IM=0; DM=0; I=-20; D=-20;
MA   /M: SY='R'; M=-12,-44,-11,-13,-13,-22,-2,-7,18,-12,5,-3,-11,0,21,-6,-5,-11,-16,-34;
MA   /M: SY='D'; M=1,-41,17,16,-41,-3,3,-11,-1,-22,-12,8,-7,12,-7,0,-2,-19,-53,-36;
MA   /M: SY='D';  M=2,-37,15,13,-36,2,5,-15,-3,-26,-17,10,-6,7,-10,3,2,-17,-53,-28;
MA   /M: SY='P'; M=1,-41,6,8,-38,-4,2,-20,9,-30,-14,6,13,9,8,3,0,-22,-48,-45;
MA   /M: SY='D'; M=2,-43,23,20,-42,2,9,-18,2,-30,-18,14,-5,14,-6,2,0,-21,-57,-35;
MA   /M: SY='D'; M=4,-34,9,8,-34,6,0,-17,5,-29,-14,8,-1,5,1,5,2,-17,-47,-38;
MA   /M: SY='F'; M=-28,-32,-38,-38,50,-42,-1,2,-11,6,-6,-21,-35,-27,-27,-24,-23,-14,-3,47;
MA   /M: SY='Q'; M=0,-33,-2,-7,-26,-9,-4,1,1,-10,1,-1,-5,2,0,-2,1,0,-44,-37;
MA   /M: SY='L'; M=-13,-36,-34,-37,23,-31,-21,28,-15,29,24,-24,-25,-24,-27,-20,-10,22,-33,0;
MA   /M: SY='K'; M=-8,-32,-5,-5,-19,-16,3,-11,13,-19,-2,1,-9,2,12,-3,-3,-15,-32,-28;
MA   /M: SY='L'; M=-10,-39,-30,-32,15,-26,-20,20,-16,27,20,-21,-20,-21,-27,-17,-9,16,-32,-5;
MA   /M: SY='D'; M=3,-48,33,27,-51,4,6,-19,0,-35,-22,18,-10,13,-13,2,0,-16,-65,-41;
MA   /I: MI=-55; MD=-55; I=-5;
MA   /M: SY='V'; D=-5; M=-3,-33,-23,-32,-5,-19,-21,28,-16,26,30,-17,-14,-15,-19,-12,-1,30,-48,-28;
MA   /I: MI=-55; MD=-55; I=-5;
MA   /M: SY='P'; D=-5; M=1,-2,-1,0,-3,0,0,-1,-1,-2,-2,0,4,0,0,1,0,-1,-4,-4;
MA   /I: MI=-55; MD=-55; I=-5;
..
... Some lines omitted..
..
MA   /M:  SY='K'; M=-11,-52,1,-1,-1,-17,2,-18,43,-28,3,9,-10,8,33,-2,-1,-23,-33,-43;
MA   /I: MI=*; MD=*; I=0;
NR   /RELEASE=40.7,103373;
NR   /TOTAL=181(180); /POSITIVE=176(175); /UNKNOWN=5(5); /FALSE_POS=0(0);
NR   /FALSE_NEG=0; /PARTIAL=4;
CC   /MATRIX_TYPE=protein_domain;
CC   /SCALING_DB=reversed;
CC   /AUTHOR=P_Bucher;
CC   /TAXO-RANGE=A?EP?; /MAX-REPEAT=2;
DR   P30223, 14KD_MYCTU, T; P46729, 18K1_MYCAV, T; P46730, 18K1_MYCIT, T;
DR   P46731, 18K2_MYCAV, T; P46732, 18K2_MYCIT, T; P12809, 18KD_MYCLE, T;
DR   P80485, ASP1_STRTR, T; O30851, ASP2_STRTR, T; P02497, CRA2_MESAU, T;
DR   P24622, CRA2_MOUSE, T; P24623, CRA2_RAT  , T; P15990, CRA2_SPAEH, T;
..
... Some lines omitted..
..
DR   P96193, IBPB_AZOVI, T; P29210, IBPB_ECOLI, T; P29778, OV21_ONCVO, T;
DR   P29779, OV22_ONCVO, T; Q06823, SP21_STIAU, T; P34328, YKZ1_CAEEL, T;
DR   P12812, P40_SCHMA , T;
DR   P81083, HS11_PINPS, P; P81161, HS2M_LYCES, P; P30220, HS3E_XENLA, P;
DR   Q9QUK5, HSB7_RAT  , P;
DR   Q29438, ODFP_BOVIN, ?; Q14990, ODFP_HUMAN, ?; Q61999, ODFP_MOUSE, ?;
DR   Q29077, ODFP_PIG  , ?; P21769, ODFP_RAT  , ?;
DO   PDOC00791;
//


     2.3. The different line types

This section describes in detail the format of each type of line used in
the database data file (PROSITE.DAT).

          2.3.1. The ID line

The ID (IDentification) line is always the first line of an entry. The
general form of the ID line is:

ID   ENTRY_NAME; ENTRY_TYPE.

The first item on the ID line is the entry name. This name is a useful
means of identifying an entry. The entry name consists of from 2 to 21
uppercase alphanumeric characters. The characters that are allowed in an
entry name are: A-Z, 0-9, and the underscore character "_".

The second item on the ID line indicates the type of PROSITE entry.
Currently this can be one the following:

 PATTERN
 MATRIX
 RULE

Examples:

ID   ADH_ZINC; PATTERN.
ID   SULFATATION; RULE.
ID   SH3; MATRIX.


          2.3.2. The AC line

The AC (ACcession number) line lists the accession number associated with
an entry. It is always the second line of an entry. Accession numbers
provide a stable way of identifying entries from release to release. It is
sometimes necessary for reasons of consistency to change the names of the
entries between releases.

An accession number, however, never change. Accession numbers allow
unambiguous citation of database entries. Researchers who wish to cite a
PROSITE entry in their publications should always cite the accession number
of that entry in order to ensure that readers can find the relevant data in
a subsequent release.

The format of the AC line is:

AC   PSnnnnn;

Where 'PS' stands for PROSITE and 'nnnnn' is a five digit number. Example:

AC   PS00123;


          2.3.3. The DT line

The DT (DaTe) line shows the date of entry or last modification of the
entry. It is always the third line of an entry. The format of the DT line
is:

DT   MMM-YYYY (CREATED); MMM-YYYY (DATA UPDATE); MMM-YYYY (INFO UPDATE).

where:

   * MMM is the month and YYYY the year.
   * The first date indicates when the entry first appeared in the
     database.
   * The second date indicates when the 'primary' data of the entry was
     last modified. By this we mean the data relevant to the pattern,
     matrix, or rule being described in that entry.
   * The third date indicates when any data other then the 'primary' data
     has been modified.

Example:

DT   APR-1990 (CREATED); JUL-1990 (DATA UPDATE); JUL-1998 (INFO UPDATE).


          2.3.4. The DE line

The DE (DEscription) line provides descriptive information about the
content of the entry. It is always the fourth line of an entry. The format
of the DE line is:

DE   Description.

The description is given in ordinary English and is free-format.

Examples:

DE   Myb DNA-binding domain repeat signature 1.
DE   Iron-containing alcohol dehydrogenases signature.
DE   Zinc finger, C2H2 type, domain.
DE   Globins profile.


          2.3.5. The PA line

The PA (PAttern) lines contains the definition of a PROSITE pattern. The
patterns are described using the following conventions:

   * The standard IUPAC one-letter codes for the amino acids are used.
   * The symbol 'x' is used for a position where any amino acid is
     accepted.
   * Ambiguities are indicated by listing the acceptable amino acids for a
     given position, between square parentheses '[ ]'. For example: [ALT]
     stands for Ala or Leu or Thr.
   * Ambiguities are also indicated by listing between a pair of curly
     brackets '{ }' the amino acids that are not accepted at a given
     position. For example: {AM} stands for any amino acid except Ala and
     Met.
   * Each element in a pattern is separated from its neighbor by a '-'.
   * Repetition of an element of the pattern can be indicated by following
     that element with a numerical value or a numerical range between
     parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds
     to x-x or x-x-x or x-x-x-x.
   * When a pattern is restricted to either the N- or C-terminal of a
     sequence, that pattern either starts with a '<' symbol or respectively
     ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539),
     '>' can also occur inside square brackets for the C-terminal element.
     'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or
     'F-[GSTV]-P-R-L>' are considered.
   * A period ends the pattern.

Examples:

PA   [AC]-x-V-x(4)-{ED}.

This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any
but Glu or Asp}

PA   <A-x-[ST](2)-x(0,1)-V.

This pattern, which must be in the N-terminal of the sequence ('<'), is
translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val


          2.3.6. The MA line

The MA (MAtrix) lines contain the definition of a PROSITE profile (or
matrix) entry. The exact format content of this line is fully described in
a specific document (PROFILE.TXT) which is part of the PROSITE distribution
files.


          2.3.7. The RU line

The RU (RUle) lines contain the definition of a PROSITE rule entry. The
format of the RU line is:

RU   Rule_Description.

The rule is described in ordinary English and is free-format.


          2.3.8. The NR line

The NR (Numerical Results) lines contain information relevant to the
results of the scan with a pattern on the complete Swiss-Prot
knowledgebase. The format of the NR line is:

NR   /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

  /RELEASE      Swiss-Prot release number and total number
                of sequence entries in that release.
  /TOTAL        Total number of hits in Swiss-Prot.
  /POSITIVE     Number of hits on proteins that are known to
                belong to the set in consideration.
  /UNKNOWN      Number of hits on proteins that could
                possibly belong to the set in consideration.
  /FALSE_POS    Number of false hits (on unrelated
                proteins).
  /FALSE_NEG    Number of known missed hits.
  /PARTIAL      Number of partial sequences which belong to
                the set in consideration, but which are not
                hit by the pattern or profile because they
                are partial (fragment) sequences.

The syntax of the /RELEASE qualifier is:

/RELEASE=nn,seq_num;

where 'nn' is a Swiss-Prot release number and 'seq_num' the total number of
Swiss-Prot entries in that release.

For all other qualifiers the syntax is:

/QUALIFIER=x(y);

or

/QUALIFIER=y;

where 'x' represents the number of hits and 'y' the number of sequences. In
the majority of pattern entries 'x' will be equal to 'y', but for those
patterns that are designed to detect domains that can be repeated more than
once in a given sequence (for example: zinc-fingers, EF-hand regions,
kringle domain, etc.), 'x' can be larger than 'y'. Such a situation is
described in the following example:

NR   /RELEASE=40.7,103373;
NR   /TOTAL=123(56); /POSITIVE=115(51); /UNKNOWN=5(2); /FALSE_POS=3(3);
NR   /FALSE_NEG=3; /PARTIAL=2;

In the above example the scan for the pattern (or profile) was done on
release 40.7 of Swiss-Prot which contained 103373 sequence entries, that
pattern (or profile) was found 123 times in 56 different sequences
(/TOTAL). Out of those 123 'hits', 115 were produced by 51 sequences that
belong to the set under consideration (/POSITIVE), 5 hits were produced by
two sequences which could possible belong to the set (/UNKNOWN) and 3 hits
were produced by 3 other sequences (/FALSE_POS). That particular pattern
missed 3 sequences (/FALSE_NEG) and there were two partial sequences that
belong to the set under consideration but which do not include the region
that contains that pattern (or profile) (/PARTIAL).

Note: for some degenerate patterns (as for example the N-glycosylation
consensus pattern), the NR lines are not provided as they would not yield
any useful information.


          2.3.9. The CC line

The CC (Comments) lines contains various types of comments. The format of
the CC line is:

CC    /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

/TAXO-RANGE     Taxonomic range.
/MAX-REPEAT     Maximum known number of repetitions of the
                pattern or profile in a single protein.

There are 2 qualifiers specific to pattern or rule entries:

/SITE           Indication of an `interesting' site in a pattern.
/SKIP-FLAG      Indication of an entry that can be, in some
                cases, ignored by a program (because it is too
                unspecific).

There are 5 qualifiers specific to profile entries:

/MATRIX_TYPE    Describes the region of the protein identified
                by the profile.
/SCALING_DB     Scaling database used to calibrate the profile.
/AUTHOR         Author of the profile.
/FT_KEY         Feature key to describe the region covered by
                the profile.
/FT_DESC        Feature description of the region covered by
                the profile.


            2.3.9.1. The /TAXO-RANGE qualifier

This qualifier is used to indicate the taxonomic range of a pattern or
matrix. The syntax of that qualifier is the following:

/TAXO-RANGE=ABEPV;

where:

   * 'A' stands for archaea
   * 'B' stands for bacteriophages
   * 'E' stands for eukaryotes
   * 'P' stands for prokaryotes (bacteria)
   * 'V' stands for eukaryotic viruses

When the pattern or matrix entry has no relevance to one of the above
taxonomic classes a question mark ('?') replaces the corresponding letter
symbol. Example:

/TAXO-RANGE=A?E??

would be used in an entry relevant to proteins of archeal ('A') and
eukaryotic ('E') origin.

Note: the /TAXO-RANGE qualifier does not take into account false positive
hits. For example: if a pattern produces one or more false positive hit(s)
on bacteriophage protein(s) but no true positive results were obtained on
any bacteriophage proteins, a question mark will be present instead of the
'B' in the second position of the /TAXO-RANGE qualifier.


            2.3.9.2. The /MAX-REPEAT qualifier

This qualifier is used to indicate the maximum number of times a given
pattern or profile has been found in a single protein sequence. The syntax
of that qualifier is the following:

/MAX-REPEAT=nn;

For example, in the CC lines of the pattern entry to detect an EF-hand
calcium-binding domain we have:

/MAX-REPEAT=8

This indicates that up to 8 copies of the EF-hand domain are known to be
present in at least one protein sequence.

Notes: One should not make the assumption that the value indicated by this
qualifier is equivalent to the maximum number of hits that will be obtained
by the pattern or profile being described; it is not uncommon that a
pattern or a profile will not detect all occurences of a repeated domain.


            2.3.9.3. The /SITE qualifier

This qualifier is used to indicate the position of an 'interesting' site in
a pattern or a profile. For example, if a pattern includes an active site
residue, the /SITE qualifier will be used to indicate the position of that
residue in the pattern. The syntax of this qualifier is the following:

/SITE=nn,text_description;

where 'nn' is the position in the pattern or the profile of the site being
described and 'text_description' a textual description of that site.
Examples:

/SITE=3,active_site;
/SITE=5,disulfide;

Notes:

For pattern entries, the position numbering is indicated in pattern element
units. For example if we want to indicate that the 'C' in the pattern
'<A-[ILMV]-x(2,4)-A-C-P' is involved in a disulfide bond we would indicate
'/SITE=5,disulfide;', the 'C' being the fifth element in the pattern.

For profile (matrix) entries, the position numbering relates to match
positions.

If necessary there can be more than one /SITE qualifier in the CC line(s)
of an entry. For example in the pattern entry specific to proteins of the
cytochrome c family, the pattern 'C-{CPWHF}-{CPWR}-C-H-{CFWY}' has the
following /SITE qualifiers in its CC lines:

/SITE=1,heme; /SITE=4,heme; /SITE=5,heme_iron;

This to indicate that the two 'C's are the residues that bind the heme
group and that the 'H' is an axial ligand to the heme iron.

If the presence of a site is assumed, but experimental data is lacking, a
'(?)' is appended at the end of the text description. For example if we
have the pattern 'A-x(2)-C-R' and the cysteine in that pattern is thought
to be involved in a disulfide bond, it would be indicated as:

/SITE=3,disulfide(?);


            2.3.9.4. The /SKIP-FLAG qualifier

Some PROSITE entries such as those describing commonly found
post-translational modifications (a typical example is N-glycosylation) are
found in the majority of known protein sequences. While it is generally
useful to note their presence, some programs may want, in some cases, to
ignore those entries. For this purpose these entries are indicated with the
following qualifier in their CC lines:

/SKIP-FLAG=TRUE;


            2.3.9.5. The /MATRIX_TYPE qualifier

This qualifier describes the region in the protein identified by the
profile. Example:

/MATRIX_TYPE=protein_domain;

The matrix type can be protein_domain, repeat_region, localization_signal
or composition where:

   Protein_domain         Describes a profile directed against
                          a conserved region of a protein.
   Repeat_region          Describes a profile directed against
                          a run of repeat units.
   Localization_signal    Describes a profile directed against
                          a region important for the
                          localization of protein in the cell.
   Composition            Describes a profile directed against
                          a region of low complexity or
                          enriched in a given amino acid.


            2.3.9.6. The /SCALING_DB qualifier

This qualifier indicates which database was used to calibrate the profile.
Example:

/SCALING_DB=window20_shuffled;

Scaling databases currently used are:

   reversed             Is a protein database, randomized by
                        taking the reverse sequence of each
                        individual entry.
   window20             Is a protein database, locally shuffled
                        in windows of 20 residues.
   window20_shuffled    Is a small version of a window20
                        protein database.
   db_global            Is a protein database, globally
                        shuffled in windows of 20 residues.


            2.3.9.7. The /AUTHOR qualifier

This qualifier is used to indicate the author that created or updated the
profile. Example:

/AUTHOR=K_Hofmann, P_Bucher;

The first name is the author of the profile, the second one the author of
the last update.


            2.3.9.8. The /FT_KEY and /FT_DESC qualifiers

These qualifiers are used to give a computer readable short description of
the region identified by the profile. They are based on the Swiss-Prot
Feature Table key and Feature Table description currently used to define
the region identified by the profile. Example:

/FT_KEY=DOMAIN; /FT_DESC=KRINGLE.

FT_KEY can be NP_BIND, MOTIF, DOMAIN, REPEAT, DNA_BIND or ZN_FING. More
details can be found on feature keys and feature descriptions in the
Swiss-Prot user manual.


          2.3.10. The DR line

The DR (Database Reference) lines are used as pointers to the Swiss-Prot
entries that are picked up (or missed) by the pattern being described in
the entry. The format of the DR line is:

DR   AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C;

where:

   * 'AC_NB' is the Swiss-Prot primary accession number of the entry to
     which reference is being made.
   * 'ENTRY_NAME' is the Swiss-Prot entry name.
   * 'C' is a one character flag that can be one of the following:

T   For a true positive.
N   For a false negative; a sequence which belongs to the
    set under consideration, but which has not been picked
    up by the pattern or profile.
P   For a 'potential' hit; a sequence that belongs to the
    set under consideration, but which was not picked up
    because the region(s) that are used as a 'fingerprint'
    (pattern or profile) is not yet available in the
    database (partial sequence).
?   For an unknown; a sequence which possibly could belong
    to the set under consideration.
F   For a false positive; a sequence which does not belong
    to the set in consideration.

Example:

DR   P10807, ADH_DROLE , T; P07162, ADH_DROMA , T; P00334, ADH_DROME , T;
DR   P09370, ADH1_DROMO, T; P09369, ADH2_DROMO, T; P07160, ADH2_DROMU, T;
DR   P12854, ADH1_DRONA, T; P07159, ADH_DROOR , T; P07158, ADH_DROPS , T;
DR   P07163, ADH_DROSI , T; P08074, AP27_MOUSE, T; P08088, BEN5_PSEPU, T;
DR   P07772, BEND_ACICA, T; P08694, BPHB_PSEPS, T; P14061, DHES_HUMAN, T;
DR   P12310, DHG_BACSU , T; P10528, DHGA_BACME, T; P07999, DHGB_BACME, T;
DR   P16232, DHII_RAT  , T; P15047, ENTA_ECOLI, T; P05406, FIXR_BRAJA, T;
DR   P05707, GUTD_ECOLI, T; P06234, NODG_RHIME, T; P06235, NODG_RHIMS, T;
DR   P15428, PGDH_HUMAN, T; P14697, PHBB_ALCEU, T; P00335, RIDH_KLEAE, T;
DR   P13859, TODD_PSEPU, T;
DR   P13203, DHG_THEAC , P;
DR   P14802, YRTP_BACSU, ?;
DR   P07161, ADH1_DROMU, N;
DR   P00805, ASPG_ECOLI, F; P13226, GALX_STRLI, F; P14373, RFP_HUMAN , F;
DR   P02788, TRFL_HUMAN, F; P08071, TRFL_MOUSE, F;

In the above example, we have pointers to 28 Swiss-Prot sequences which are
true positives ('T'), one which is a potential hit ('P'), one for a
sequence that may belong to the set under consideration ('?'), one which
has been missed ('N'), and five sequences that are false positives ('F').


          2.3.11. The 3D line

The 3D (3D-structure) line is used to list the code(s) of the Protein Data
Bank (PDB) entries that contain structural data corresponding the sequence
region described in a PROSITE entry. The format of the 3D line is:

3D   name; [name2;...]

Example:

3D   7WGA; 9WGA; 1WGC; 2WGC;


          2.3.12. The DO line

The DO (DOcumentation) line contains a pointer to the entry in the PROSITE
documentation file that describes the entry. The format of the DO line is:

DO   PDOCnnnnn;

where 'PDOC' stands for PROSITE DOCumentation and 'nnnnn' is a five digit
number. Example:

DO   PDOC00128;


          2.3.13. The termination line

The // (terminator) line contains no data or comments. It designates the
end of an entry.


     2.4. Documentation file structure

The PROSITE documentation file is an ASCII file. The maximum line length
has been set to 78 characters. The general format of a documentation entry
is the following:

  {PDOCnnnnn}
  {PSmmmmm; ENTRY_NAME}
  ..
  {BEGIN}
  Documentation text lines
  .
  ..
  {END}


   * The first line '{PDOCnnnnn}', where 'nnnnn' is a five digit number is
     the documentation entry accession number.
   * The following lines '{PSmmmmm; ENTRY_NAME}' list the accession number
     and entry name of the PROSITE data file entri(es) that correspond to
     the documentation entry.
   * The documentation text lines are in ordinary English and are
     free-format. The only restriction is that they do not start with the
     character '{'.
   * Reference to other PROSITE documentation is indicated as followed:

(see <PDOC00100>)

    * Reference to PDB are is indicated as followed:

(see <PDB:1A4B>)

      or

(see  <PDB:1J5E; M>)

      where M is the name of a chain.

As an example, we show here a section of the documentation file that
contains two entries.

{PDOC00082}
{PS00087; SOD_CU_ZN_1}
{PS00332; SOD_CU_ZN_2}
{BEGIN}
***********************************************
* Copper/Zinc superoxide dismutase signatures *
***********************************************

Copper/Zinc superoxide dismutase (EC 1.15.1.1) (SODC) [1] is  one of the three
forms of an enzyme that catalyzes the dismutation of superoxide radicals. SODC
binds one atom each  of zinc and copper.  Various forms  of  SODC are known: a
cytoplasmic  form in  eukaryotes, an additional chloroplast form in plants, an
extracellular form in some  eukaryotes, and a periplasmic form in prokaryotes.
The metal binding sites are conserved in all the known SODC sequences [2].

We derived two signature  patterns for this family of enzymes:  the  first one
contains two  histidine residues that  bind the copper atom; the second one is
located in the C-terminal section of  SODC  and  contains a  cysteine which is
involved in a disulfide bond.

-Consensus pattern: [GA]-[IMFAT]-H-[LIVF]-H-x(2)-[GP]-[SDG]-x-[STAGDE]
                    [The two H's are copper ligands]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in Swiss-Prot: 5.

-Consensus pattern: G-[GN]-[SGA]-G-x-R-x-[SGA]-C-x(2)-[IV]
                    [C is involved in a disulfide bond]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in Swiss-Prot: NONE.

-Note: these patterns will not detect proteins related to SODC, but which have
 lost their catalytic activity, such as Vaccinia virus protein A45.

-Last update: July 1999 / Patterns and text revised.

[ 1] Bannister J.V., Bannister W.H., Rotilio G.
     CRC Crit. Rev. Biochem. 22:111-154(1987).
[ 2] Smith M.W., Doolittle R.F.
     J. Mol. Evol. 34:175-184(1992).
{END}
{PDOC00083}
{PS00088; SOD_MN}
{BEGIN}
******************************************************
* Manganese and iron superoxide dismutases signature *
******************************************************

Manganese  superoxide dismutase (EC 1.15.1.1) (SODM)  [1] is  one of the three
forms of an enzyme that catalyzes the dismutation  of superoxide radicals. The
four  ligands of  the manganese atom  are  conserved in  all  the  known  SODM
sequences.  These metal ligands are also conserved in the related iron form of
superoxide  dismutases [2,3].  We selected, as  a signature, a short conserved
region which includes two of the four ligands: an aspartate and a histidine.

-Consensus pattern: D-x-W-E-H-[STA]-[FY](2)
                    [D and H are manganese/iron ligands]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in Swiss-Prot: NONE.
-Last update: June 1992 / Text revised.

[ 1] Bannister J.V., Bannister W.H., Rotilio G.
     CRC Crit. Rev. Biochem. 22:111-154(1987).
[ 2] Parker M.W., Blake C.C.F.
     FEBS Lett. 229:377-382(1988).
[ 3] Smith M.W., Doolittle R.F.
     J. Mol. Evol. 34:175-184(1992).
{END}

  ------------------------------------------------------------------------