UniProtKB/Swiss-Prot Protein Knowledgebase
Swiss-Prot headline

Release 57.15 of 02-Mar-2009

Release 57.15 of 02-Mar-2009

Bacillus subtilis, a Gram-positive model bacterium fully annotated in UniProtKB/Swiss-Prot

We are all aware of the importance of model bacterial systems. Escherichia coli K12 is the paradigm for Gram-negative bacteria, but what of Gram-positive bacteria? There are a large variety of these bacteria that serve us, are neutral or infect us, and model systems for these bacteria are in demand.

Bacillus subtilis, a rod-shaped, soil-and water-dwelling bacterium originally described as Vibrio subtilis in 1835 by Ehrenberg and renamed in 1872 by Cohn has served this role for over a century. B.subtilis differentiates to produce endospores, can be made naturally competent for DNA uptake and is a bacteriophage host. In the wild it has been seen to produce over 2 dozen different antibiotics. These characteristics make it an obvious choice as a model system for bacterial differentiation and genetics, as well as a model for other - often more dangerous - bacteria such as Bacillus anthracis, Mycobacterium tuberculosis or Staphylococcus aureus. Additionally, it is used for the production of various industrially interesting enzymes such as amylases and proteases. A substrain, B.subtilis natto, is used to prepare natto, a traditional Japanese dish made from fermented soybeans. Although B.subtilis is not considered pathogenic for any known organism, it has been isolated from patients suffering from various illness such as endocarditis, pneumonia etc., and also occasionally from spoiled food where it might be responsible for cases of food poisoning.

The genome of B.subtilis 168, a widely used laboratory strain, was sequenced by a large international consortium in 1997 - the 6th bacterium to be fully sequenced. The sequence was updated and reannotated in 2009 by the Institut Pasteur and the Génoscope. In coordination with them we have annotated the complete proteome, providing all 4'192 B.subtilis proteins in UniProtKB/Swiss-Prot, each of which has a cross-reference to the dedicated B.subtilis database SubtiList/GenoList as well as other databases. A list of all B.subtilis UniProtKB/Swiss-Prot entries is available in the bacsu.txt file. This of course provides a snapshot of the knowledge about this first fully manually annotated Gram-positive model organism and will date easily. Despite having been so intently studied for so long, there are many B.subtilis proteins about which we know very little. There will be work for years to come for the B.subtilis (and larger scientific) community as these proteins and their homologues are characterized.

All B.subtilis entries can be retrieved from UniProtKB/Swiss-Prot combining the organism name "Bacillus subtilis" (or the taxonomy identifier 1423) with the keyword 'Complete proteome' (organism:"Bacillus subtilis" AND keyword:"Complete proteome" or organism:1423 AND keyword:181).

Release 57.14 of 09-Feb-2009

Bornavirus: another viral stowaway in the human genome

Analysis of the human genome sequence has revealed that our 'book of life' is multi-authored. About 0.5% of human genes are derived from bacteria and 8% of our total genetic material results from viral infections (see also release 2.1 headline). These genomic viral "fossils" are ancient retroviruses, which are known to insert their genetic information into host chromosomal DNA. They do so by producing a DNA copy from their RNA genome by use of a viral enzyme, called reverse transcriptase. The viral DNA then integrates into the host genome, becoming a permanent part of the cell.

A recent Japanese study has unveiled another viral stowaway in the human gene pool. Several copies of the bornavirus N gene turn out to be part of the human genome and of other mammalian genomes, including chimpanzees, gorillas and African elephants. These genes are remnants of a bornavirus which presumably infected proto-hominids, and other species, some forty million years ago. This ancient virus has disappeared and nowadays bornaviruses are known to infect mainly horses, inducing neurological diseases.

This discovery came as a surprise since the bornaviral RNA genome is not known to be retrocopied into DNA at any stage of the viral replication cycle and never integrates into the host genome. This unusual integration into our ancestor's genome may have helped him survive against a pathogenic virus or may have played a role in primate evolution. As often in evolutionary biology, there are many more questions than answers, but this serves as a useful reminder that human evolution does not rely only on our own intrinsic potential, but also on a tight interaction with other living species in our environment.

A bornavirus-derived gene is actually expressed in human cells. It is called 'Endogenous Borna-like N element' (EBLN-1) and can be retrieved from UniProtKB/Swiss-Prot using the accession number Q6P2I7.

Release 57.13 of 19-Jan-2009

XMRV complete proteome in UniProtKB/Swiss-Prot

Despite the 118 Human pathogenic viruses identified so far, our knowledge of these pathogens is still incomplete. Several human pathologies are suspected to be induced by unknown viruses. In this context, a new virus was isolated from human prostate in 2006 and was named 'Xenotropic Moloney murine leukemia virus-Related Virus' (XMRV). This retrovirus is the first representative of the gammaretrovirus genus to be isolated in humans. These retroviruses are known to induce various cancers in their host and a causal link with prostate cancer was suspected. This link was experimentally established but later refuted and thus remains a matter of debate. The same virus has been recently associated with chronic fatigue syndrome (CFS): XMRV has been isolated in 4% of healthy subjects, and in 67% of CFS patients. Large scale epidemiological studies must be performed to establish with certainty whether these correlations are relevant.

Where did XMRV come from? Retroviruses identified in patients with CFS or prostate cancer are highly related (more than 90% DNA sequence identity) to a group of mouse viruses called xenotropic murine leukemia virus (MLV). Xenotropic MLVs are endogenous retroviruses, i.e. the viral DNA is stably integrated in the mouse genome. Mice produce low levels of the virus - a few infectious particles per ml of blood - but the virus cannot reinfect mouse tissues. Instead it spreads to other species, such as humans, which is the reason for the term 'xenotropic', meaning the virus can grow in species other than the species of origin. Therefore it makes sense to hypothesize that XMRV is a xenotropic MLV that crossed from mice to humans.

The mode of transmission of XMRV is largely unknown. It could be via transfusion, intravenous drug use, or by other blood-borne routes, but other modes of transmission (respiratory, sexual, etc.) cannot be excluded.

It will take time to answer the numerous questions raised by the discovery of XMRV. In terms of treatment, the good news is that some of the anti-retroviral drugs used for treating AIDS can immediately be tested for their efficacy against CFS. Indeed, susceptibility of XMRV to AZT has recently been demonstrated.

The complete proteome of XMRV has been annotated along with that of the well-studied MLV which is 65% (env) to 85% (gag-pol) identical and has served as a model for XMRV functional annotation.

Release 57.12 of 15-Dec-2009

Through the Looking-Glass

All amino acids but glycine can exist in either of two optical isomers, called L-or D-amino acids, which are mirror images of each other. However, we have been taught for decades that proteins that occur in nature are made out of L-forms. There are some well-known exceptions, of course, but restricted to prokaryotes. Indeed, D-forms are abundant components of the peptidoglycan cell walls of bacteria, and are also observed in bacterial natural antibiotics, such as actinomycin D, bacitracin or tetracycline. These latter are quite unusual peptides that are synthesized by multienzyme complexes in a stepwise fashion without the participation of mRNA. It has also been observed that the mammalian brain contains high levels of free D-serine which appears to be a physiological coagonist of N-methyl D-aspartate receptors (NMDARs) and, as such, may act as a neurotransmitter in the brain, but this activity is carried out by the amino acid itself and does not occur within the context of a polypeptide. The isolation, in the 1980s, of naturally occurring animal peptides containing D-amino acids challenged the dogma, leading to the discovery of a new post-translational modification (PTM): L- to D-isomerization.

In 1981, Montecucchi et al., looking for enkephalin-related peptides in various amphibia, isolated dermorphin from the skin of Phyllomedusa sauvagei. Dermorphin is produced by 2 different precursors: cleavage of Dermorphin-1 gives rise to 4 mature dermorphins and that of Dermorphin-2 to 5 mature peptides, all of which have the identical sequence: YAFGYPS. This heptapeptide binds with high affinity and selectivity to mu-type opioid receptors and appears to be a thousand times more potent than morphine in inducing deep long-lasting analgesia when injected into mice or rats. Interestingly, the second amino acid of dermorphin is D-alanine. A synthetic isomer, containing L-alanine at that position, is virtually devoid of biological activity.

This discovery was followed by many others. Deltorphins, another class of frog opioid peptides, also characterized by a D-amino acid at position 2, were isolated. Another amphibian, Bombina variegata, was shown to express antimicrobial D-amino acid-containing peptides, called bombesins, on its skin. Arthropoda, such as spider, lobsters and crayfish, and Mollusca entered the game. Cone snail peptide toxins have been extensively studied in this context and they currently represent 60% of all animal D-amino acid-containing proteins annotated in UniProtKB/Swiss-Prot. A single mammal appears on the list: platypus with 2 peptides, C-type natriuretic peptide 39 and Defensin-like peptide 2/4, expressed in its venom gland.

Animal D-amino acid-containing proteins are synthesized on ribosomes following a classical mRNA template; unusual codons have not been observed. In addition, some of them have been isolated from their biological source with both L- and D-amino acid at the appropriate position. These observations suggested that L- to D-amino acid isomerization is a bona fide PTM. An enzyme catalyzing the conversion of an Omega-agatoxin-Aa4b serine (at position 46 of the mature peptide, 81 in the precursor) from L- to D-form has been isolated from the funnel-web spider Agelenopsis aperta and its partial sequence is available in UniProtKB/TrEMBL. A similar mammalian activity has been characterized from platypus venom.

L- to D-amino acid isomerization presents significant advantages. The modified peptides become more resistant to protease degradation and hence much more stable. In addition, X-ray crystallography studies have shown that the isomerization creates new structures, such as peculiar beta-turns. The creation of these new structural elements seems crucial for interaction with specific partners, opiate receptors for instance, and may act as a switch that turns on protein activity.

L- to D-amino acid isomerization could be more frequent than initially thought. It cannot be predicted by software tools and is not detectable by any of the standard techniques used in proteomics. It was only discovered when a synthetic peptide with the same sequence of L-amino acids appeared to be biologically inactive. We could be facing a novel strategy of multicellular organisms to circumvent stereochemical limitations imposed by the genetic code in an effort to increase molecular diversity.

In UniProtKB, all D-amino acid-containing proteins can be retrieved using the keyword 'D-amino acid'. To restrict the search to animal proteins, add 'Metazoa' to the taxonomy field.

Release 57.11 of 24-Nov-2009

Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?

More than 99% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources. These CDS are either generated by the application of gene prediction programs to genomic DNA sequences or via the hypothetical translation of cloned cDNAs (see FAQ 37). These methods themselves provide varying degrees of support for the existence of a protein, which may be further supplemented in some cases by other types of evidence (such as mass spectrometry data or evidence from direct protein sequencing).

In July 2007, a new topic was introduced into UniProtKB to indicate the evidence for the existence of a given protein, called 'Protein existence' (PE). 5 levels of evidence have been defined: 1. evidence at protein level (e.g. clear identification by mass spectrometry), 2. evidence at transcript level (e.g. the existence of a putative coding cDNA), 3. inferred by homology (a predicted protein which has been assigned membership of a defined protein family in UniProtKB), 4. predicted (a predicted protein which has not yet been assigned membership of a defined protein family in UniProtKB) and 5. uncertain (e.g. dubious sequences, such as those derived from the erroneous translation of a pseudogene or non-coding RNA). Currently in UniProtKB/Swiss-Prot, the vast majority (71%) of the entries are found in the PE3 category. PE1 and PE2 represent each approximately 13% of the total number of entries, PE4 3% and PE5 only 0.3%.

Entries that are attributed an existence level of 5 (PE5) are also tagged with the term "Putative" in the 'Protein names' section (see for example Putative annexin A2-like protein) and, in the 'General annotation (Comments)' section, with a 'Caution' subsection warning the user of a possible problem. The 'Caution' subsections accompanying a PE5 entry usually are of the type: "Could be the product of a pseudogene", "Product of a dubious CDS prediction" or "Product of a dubious gene prediction".

The PE section is included in the UniProtKB search engine. It is thus possible to retrieve all entries corresponding to a defined PE level - and thereby exclude all PE5 proteins. For human proteins this can be achieved by searching for: (organism:"Homo sapiens (Human) [9606]" AND reviewed:yes) NOT existence:uncertain. This search allows the retrieval of 19'835 entries, indicating that "uncertain" proteins represent 2.4% of the total human entries. Currently PE5 entries represent only 0.3% of all UniProtKB/Swiss-Prot. The higher proportion of sequences identified as uncertain or dubious in Homo sapiens may be a product of the continuous manual curation and review of these sequences by groups of the CCDS consortium, such as HAVANA, as well as UniProt curators.

One may ask the question: why not delete PE5 sequences from UniProtKB and provide only the most reliable sequences? As stated above, UniProt is continuously reviewing all protein sequences. This process can result in both the removal of some PE5 entries (in which evidence of pseudogenization is overwhelming for instance) as well as the upgrade of other PE5 entries (such as the putative E.coli pseudogene ymiA which has now been found to produce a protein product and which has now acquired a PE of 1 or the human mitochondrial ATP synthase subunit epsilon-like protein). However, many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein, and for certain loci some doubts may always persist. To give our users the opportunity to work on the most complete protein set, we have chosen to keep all PE5 sequences with the appropriate 'Caution' comments, leaving to the users the final decision whether to retrieve them or not (using the exclusion mechanism described above). Note that the sequences which are removed from UniProtKB can subsequently be retrieved from the UniParc archive if so desired.

Finally, please remember that the PE assignment is made at the level of the UniProtKB entry and not at the level of individual isoform sequences; hence, dubious alternative isoform sequences cannot be excluded from a protein set by the UniProtKB search engine. However, comments about the evidence supporting the existence of any given isoform can be found in the 'Note:' for that isoform in the 'Alternative products' section (which lists all protein isoforms for each entry). For instance, isoforms that have been identified only once through large scale sequencing are tagged with the comment "No experimental confirmation available". Note that UniProt may include isoforms that contain retained introns (as these may be physiologically relevant) as well as isoforms that contain a premature stop codon and thus could be the target for nonsense-mediated mRNA decay (NMD). The mechanism of NMD involves a first round of translation before the premature stop codon is detected (often referred to as "pioneer translation"), and so at least one protein is synthesized from each NMD target mRNA. In addition, some of the predicted NMD targets appear to be the most abundant isoforms in certain tissues (see for instance the human GABA-B receptor 1 isoform 1E).

For additional information, see the document describing the criteria used to assign the PE level of entries and the UniProtKB user manual.

Release 57.10 of 03-Nov-2009

What are UniProt 'Complete proteome' sets? How to retrieve them?

The need for users to access and download complete proteomes is unquestionable and the role of a database like UniProtKB is to meet this demand. The issue looks quite simple: there are more and more fully sequenced genomes. These genomes should contain at least minimal annotation, such as gene predictions, and translation of the predicted coding regions (CDSs) should provide a global perspective of the likely proteome of a given organism. The situation is actually more complex. The development of new sequencing techniques is generating a flood of data, which are often left as they have been produced. Databases have to deal with this ever-growing amount of data. The aim of this headline is to provide you with some tips on how we currently approach the problem, keeping in mind that the situation is rapidly evolving.

In order to give our users access to the proteomes of organisms whose genome has been fully sequenced, we have created the 'Complete proteomes' pages. Currently the proteomes of 1'428 organisms are available from these pages, 60% are bacteria, 30% viruses, 5.5% eukaryota and 4.5% archaea. Note that the term 'organism' is used in a broad sense and also includes strains or subspecies. Indeed, each completely sequenced strain is assigned a separate taxonomic identifier and is processed like an independent organism. A striking example of this approach is provided by Escherichia coli for which no less than 24 strain-specific proteomes can be downloaded separately.

A minority of the UniProt proteomes have been entirely manually reviewed and are found in UniProtKB/Swiss-Prot. These include 8 microbial (Methanocaldococcus jannaschii, 3 subspecies of Buchnera aphidicola, Escherichia coli (strain K12), Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae) and 3 eukaryotic sets (Saccharomyces cerevisiae, Schizosaccharomyces pombe, and last, but not least Homo sapiens). The current sets are as stable as new discoveries allow. New proteins may be identified and will have to be annotated.

However, most proteome sets comprise 2 components, i.e. a manually reviewed protein set (Swiss-Prot) and an automatically annotated one (TrEMBL), and both are automatically combined to generate a non-redundant proteome. The proportion of Swiss-Prot versus TrEMBL entries is variable and depends upon the organism. For instance, 93% of the Bacillus subtilis proteome has been manually reviewed, while the reverse is true for Bacillus cereus for which 93% of the proteome is only automatically annotated and found in the TrEMBL section of UniProtKB. Note that the B.subtilis proteome will be fully in the Swiss-Prot section by the end of the year.

A third category of proteomes exists for organisms whose genomes have submission/annotation problems that prevent the production of a non-redundant protein set or have problems regarding the gene model predictions. These proteomes can be downloaded from Integr8 using the direct link provided on the 'Complete proteomes' pages. This concerns 38 organisms, including some important model organisms, such as Danio rerio (Zebrafish) and Chlamydomonas reinhardtii.

To be included in the 'Complete proteomes' pages, an organism must have a completely sequenced genome, i.e. fully closed and exhibiting either good gene prediction models or good quality transcriptome/proteome data. That is why for bacterial and archaeal genomes, whole-genome shotguns (WGS) and draft sequences are not included. However, we have to adapt to data availability, thus for fungi, WGS sequences are taken into consideration, as they often are the only available ones.

Another requirement is that all proteins in the set are mapped to the genome. The notorious exception is that of the human proteome, which is yet only partially mapped. It should be noted, however, that all human protein entries have been manually reviewed, thus ensuring they meet the UniProtKB/Swiss-Prot quality standards, and are continuously updated, allowing us to progressively increase the mapping to the genome (and to add many other interesting annotations).

All complete proteomes are available from the UniProt taxonomy resource. A direct link is provided from the UniProt homepage. In addition to providing the taxonomic information about a given species, these pages offer several options, such as the retrieval of all UniProtKB entries for a taxon (a set that may contain redundant entries) or the retrieval of the non-redundant complete proteome set (see for example the Dictyostelium discoideum (Slime mold) page), including the sets provided by the Integr8 resource. For the 1'390 complete proteomes entirely stored in UniProtKB, all entries have been tagged with the keyword 'Complete proteome' allowing their easy retrieval directly from the database, bypassing the taxonomy pages.

For complementary information, see FAQ.

If you have questions on that subject - or any other - do not hesitate to contact us.

Release 57.9 of 13-Oct-2009

Trichophyton tonsurans: an uninvited guest at the World Judo Championships

The World Judo Championships 2009 took place few weeks ago in Rotterdam, Netherlands. A sword of Damocles was hanging over this competition. Its name: Trichophyton tonsurans, a fungal parasite. Japanese academics have raised the alarm: the national sports of sumo and judo may decline because of the rapid spread of this skin-eating fungus. The infection is similar to athlete's foot. It is highly infectious and difficult to treat. It causes itchy red patches on the neck, face and upper body. It often affects the scalp and eventually attacks hair follicles, causing baldness. This distribution is consistent with areas of contact during the grappling that is at the heart of sumo and judo sports, suggesting that the fungus spreads by direct skin-to-skin contact.

Trichophyton tonsurans is just one member of a large family of fungal parasites, called dermatophytes. Dermatophytes are not opportunists, but true pathogens that infect nonliving, cornified layers of the skin, hair and nail in warm and moist environments suitable for proliferation. They are the most common agents of superficial mycoses.

The virulence of dermatophytes is largely due to the secretion of many different proteolytic enzymes. Their genomes encode dozens of secreted proteases. To improve the digestion efficiency of infected tissues, the pathogens secrete proteolytic "cocktails" composed of endo- and exoproteases.

During keratin degradation and digestion, dermatophytes also excrete sulphite through an efflux pump (SSU1). Sulphite reduces cystines, which are abundant in keratins, into cysteine and S-sulphocysteine. As a result, the proteins become more prone to hydrolysis by the secreted "protease cocktail". SSU1 may also play an additional role. Indeed, living in a cyst(e)ine-rich environment, such as the epidermal stratum corneum, hair and nails, may have the fatal drawback of sulphur toxicity. Thus, by excreting excess sulphur as sulphate and sulphite, the pump may also protect dermatophytes from poisoning.

Two large families of secreted endoproteases have been identified in dermatophytes: the subtilisin-like endoproteases SUB1 through SUB7 and the metalloproteinases, also called fungalysins. The exoproteases comprise dipeptidylpeptidases, such as DPP4 and DPP5, aminopeptidases, such as LAP1 and LAP2, as well as carboxypeptidases, such as MCPA, MCPB, SCPA and SCPB. All these proteins have been manually annotated and integrated into UniProtKB/Swiss-Prot.

Orthologous proteins have also been identified in Trichophyton rubrum, the predominant causative agent for superficial dermatomycosis, Arthroderma benhamiae, another dermatophyte triggering severe inflammatory responses in humans, Trichophyton equinum causing ringworm in horses, Nannizzia otae, also known as Microsporum canis, a common zoophilic fungal parasite, and several other less studied dermatophyte species.

In addition to virulence factors, the complete proteome of Nannizzia otae is now available in UniProtKB.

Dermatophytes are fascinating examples of evolutionary adaptation. These fungi have developed sophisticated weapons at our expense to achieve their goal: survival. Like David against Goliath, they have a good probability of winning the battle and sumo wrestlers may well lose their top-knots.

As of this release, 110 dermatophyte virulence factors have been manually annotated and integrated into UniProtKB/Swiss-Prot.

Release 57.8 of 22-Sep-2009

300'000 HAMAP cross-references in UniProtKB/Swiss-Prot

Bacteria and archaea can live in pretty much every environmental niche we know of. From the bottom of the ocean floor to arctic ice, from wastewater treatment sludge to animal-and plant- associated environments, bacteria and archaea are everywhere. To explore this diversity the number of bacterial (and to a lesser extent archaeal) genomes being sequenced is rising practically exponentially, giving rise to huge numbers of protein sequences that are annotated to varying degrees of quality. To be able to use this data appropriately quality annotation is however essential. In order to supply this we started the HAMAP project (High-quality Automatic and Manual Annotation of microbial Proteins) in 2000. In this project, proteins from complete bacterial and archaeal proteomes, together with related plastid proteins, are automatically annotated based on manually created annotation templates for complete protein annotation, with template-based feature propagation. The annotation templates and much more are available on the HAMAP website. As of January 2008 the sequences annotated by the HAMAP pipeline that fulfill all of its stringent criteria have been entering automatically into the Swiss-Prot section of UniProtKB (see release 54.7 news).

There are now 304'013 UniProtKB/Swiss-Prot entries with a HAMAP cross-reference line to at least one of the 1'595 HAMAP families; 278'635 are bacterial, 14'601 are archaeal and 10'777 are encoded in plastids. Note that some of these entries are the templates for their families (see for example P31120); they include extra information not propagated to all members (for example biophysical chemical characterization, mutagenesis experiments, 3D structures, induction and so on) that has allowed their use as models to annotate further entries (compare entry B6I1P9 containing propagated annotation based on the family rule MF_01554 with model entry P31120).

This large number of semi-automatically annotated entries means that nearly 60% of UniProtKB/Swiss-Prot consists of HAMAP entries; add to this the approximately 30'000 other bacterial and archaeal entries that are not members of a HAMAP family and you find that the total number of bacterial and archaeal entries in UniProtKB/Swiss-Prot begins to reflect their preponderance in nature...

Release 57.7 of 01-Sep-2009

Formyl peptide receptors: the missing link between olfaction and immune system

Olfaction plays a major role in the social life of many animals, including mammals, and in their interaction with the environment. In most mammals, the olfactory system has 2 components. 1) The main system is located in the nasal olfactory epithelium (OE) and detects environmental odors, such as those emitted by food and predators. 2) The accessory system is located in the vomeronasal organ (VNO) and detects pheromones. The VNO is linked directly to the brain's emotional centers, such as amygdala and hypothalamus, which control basic drives, hormonal levels, and instinctive behaviours, while OE signals are sent to higher cortical and limbic areas. As a result, signals conveyed by VNO trigger immediate reactions.

Recently, a new family of vomeronasal chemoreceptors has been identified, termed the formyl peptide receptors. In the mouse, 5 formyl peptide receptors are expressed in VNO: Fpr-rs1 (also called Fpr3), Fpr-rs3, Fpr-rs4, Fpr-rs6, and Fpr-rs7. Fpr-rs1, as well as another member of the family, Fpr1, have been previously shown to be expressed within granulocytes, monocytes and macrophages of the immune system. Their ligands include N-formyl-methionyl peptides (fMLP) released by Gram-negative bacteria, HIV-derived peptides, the antimicrobial peptide CRAMP, lipoxin A4, etc. Upon ligand recognition, these chemoreceptors stimulate chemotaxis of the immune cells to the site of infection or tissue damage.

Interestingly, VNO Fpr-rs respond to various degrees to most of the stimuli that affect their relatives in the immune system. Sensitivity to disease/inflammation-related ligands presents major advantages, such as the detection of spoiled food. Although Fpr-rs agonists are mostly produced in tissues and serum after inflammation, they are also present in some bodily fluids, such as urine. This could allow their olfactory detection by conspecifics, leading to the rapid isolation of sick individuals and hence minimizing the risk of disease spreading within a community.

As of this release, all 5 VNO Fpr-rs are annotated and available from UniProtKB/Swiss-Prot.

Release 57.6 of 28-Jul-2009

Microsporidian polar tube: a molecular syringe in UniProtKB/Swiss-Prot

Microsporidia are ubiquitous, obligate intracellular spore-forming fungal parasites which infect a wide range of invertebrates and vertebrates. They are common pathogens responsible for opportunistic infections in immunodeficient humans, such as HIV-infected patients or patients being treated with immunosuppressive drugs. The most common microsporidian associated with AIDS is Enterocytozoon bieneusi which induces chronic diarrhea in HIV-infected individuals. However, since no animal model for E.bieneusi is available, most of the experimental studies on microsporidia have been carried out on Encephalitozoon cuniculi. This microsporidium, which commonly infects rodents, has also been reported to infect humans. Its complete proteome is available in UniProtKB.

Microsporidia are primitive organisms lacking fundamental organelles found in other eukaryotes, such as stacked Golgi apparatus, peroxisomes or mitochondria. However, they have a mitochondrial relic organelle called the mitosome which does not contain any DNA. As a result, to persist in the environment, they have to parasitize the cells of higher organisms.

How do they achieve their goal? The microsporidian intracellular developmental cycle leads to a terminal sporogenic phase producing small spores which are critical for their host-to-host transmission. The unicellular spores have a resistant wall protecting a mononucleate or binucleate sporoplasm (the infectious apparatus of the spore) and an extrusion apparatus consisting of a single polar tube with an anterior attachment complex. Once the target cell is recognized, the polar tube acts as a syringe: it pierces the host cell membrane and rapidly "injects" the sporoplasm into the host cell.

3 polar tube proteins have been identified: PTP1, PTP2, and PTP3. The major polar tube protein, PTP1, accounts for at least 70% of the mass of the polar tube. Before the polar tube can act, the spore has to recognize the host cell and stick to its surface. This role is played by EnP1, which is involved in the adhesion of spores to host cell surface glycoaminoglycans. Orthologous proteins have been identified: EnP1, PTP1 and PTP2 in Encephalitozoon intestinalis and PTP1 and PTP2 in Encephalitozoon hellem. These 2 microsporidian species infect man and cause intestinal infections keratoconjunctivitis, and respiratory infections.

All these infectious proteins are available in UniProtKB with the following accession numbers:

Release 57.5 of 07-Jul-2009

New insights into drug development with Polyketide synthases

Polyketides are secondary metabolites produced by numerous organisms, from bacteria, fungi and plants to animals. Polyketides are structurally very diverse, thousands of different polyketides have already been discovered, and they possess a wealth of biological activities, including antimicrobial, antifungal and antiparasitic functions. They endow their producing organism with increased fitness in an environment full of competitors. And not only the producers! We also take advantage of these compounds and many are in commercial use as natural insecticides, cholesterol-lowering agents, antitumor drugs or immunosuppressors.

Polyketides are synthesized by an important family of enzymes, called polyketide synthases (PKSs). PKSs are large multifunctional proteins, with an average length of over 2'500 amino acids, up to almost 5'000 amino acids, often bearing several different catalytic activities. Polyketide biosynthesis proceeds by the assembly of simple blocks, such as propionyl-CoA, butyryl-CoA or acetyl-CoA, in a process that closely parallels fatty acid biosynthesis. The fascinating diversity of polyketides arises through various mechanisms: use of different starter molecules, different chain extension substrates, generation of chiral centers, functional group modifications, such as cyclization, etc.

The social amoeba Dictyostelium discoideum lives in the soil and feeds on a variety of bacteria and fungi. In its natural habitat, D. discoideum has several rivals, such as bacteria, nematodes, and Dictyostelium caveatum. However, this slime mold is not defenseless and it has been shown, for instance, to be able to repel nematodes "by secreting compounds". It appears today that D. discoideum has at least 40 functional PKS genes and 5 probable pseudogenes. This is the largest number of PKSs of all known genomes.

These proteins are very interesting. Understanding the exact enzymatic mechanisms in play and the role of each PKS module in the generation of polyketide diversity may allow us to engineer new PKSs that could produce new active compounds. The way may be paved for the discovery of fundamentally new types of drugs!

As of this release, all D. discoideum PKSs can be retrieved from UniProtKB/Swiss-Prot.

Release 57.4 of 16-Jun-2009

Dioxygenases: from antigenic variation to myeloid malignacies

Beta-D-glucopyranosyloxymethyluracil, also called base J, was the first hypermodified base to be identified in eukaryotic DNA, in 1993, in the nucleus of Trypanosoma brucei. Base J is shown to be present in all kinetoplastids analyzed, in the related marine flagellate Diplonema and in Euglena gracilis, a unicellular alga closely related to the Kinetoplastida (see review). Base J was not only absent in a variety of other protozoa, fungi and vertebrates, but most organisms lacking base J contain DNA glycosylases attacking hydroxymethyldeoxyuridine (HOMedU), thus actively preventing the appearance of this intermediate of base J synthesis. Mammals even contain a highly active dedicated HOMedU glycosylase.

The biosynthesis of base J has been characterized. It requires 2 dioxygenases: JBP1 and JBP2. But the precise function of this DNA modification is not clear. It seems to play a role in Trypanosoma or Leishmania antigenic variation. It was first suspected to be involved in gene silencing, but this hypothesis lacks support. A current idea is that it may regulate homologous recombination at telomeres where most genes encoding Variant Surface Glycoproteins (VSG) are located.

Although the existence of such a complex DNA modification seemed unlikely in vertebrates, Tahiliani et al. (2009) performed a computational search and found JBP homologs throughout metazoans, including man where 3 homologs - TET1, TET2, TET3 - were identified. Human TET1 was unambiguously shown to be able to catalyze the conversion from 5-methylcytosine to 5-hydroxymethylcytosine (hmC). Was it just a pure exercise in style? Actually not. HmC is present in mouse embryonic stem cells. Moreover it appears to be quite abundant in mouse brain, where it constitutes up to 0.6% of total nucleotides in Purkinje cells and 0.2% in granule cells (Kriaucionis and Heintz, 2009).

Human TET1 has been known since 2002 to be involved in some acute leukemias, where it plays the role of the fusion partner of MLL in the translocation t(10;11)(q22;q23) (Ono et al., 2002). In the first months of 2009, several articles pointed at TET2 mutations that contribute to pathogenesis of a wide spectrum of myeloid malignancies, including myelodysplastic syndromes, myeloproliferative disorders, acute myeloid and chronic myelomonocytic leukemias.

A new exciting area of investigation is now open to understand the physiological function of TET/JBP family members, which may be quite crucial in view of the dramatic consequences of their mutation. As of this release, the manually annotated protein sequences of these enzymes are available from UniProtKB/Swiss-Prot: JBP1, including Trypanosoma cruzi isoenzymes JBP1A and JBP1B, JBP2, TET1, TET2 and TET3.

Release 57.3 of 26-May-2009

Rotavirus: a serial killer in UniProtKB/Swiss-Prot

Rotaviruses can infect humans, as well as other vertebrates. They cause severe diarrheal disease and dehydration of infants in both developed and developing countries. An estimated 0.6-0.8 million children aged 5 and under die from rotavirus-induced severe dehydrating diarrhea each year. Although mortality due to rotavirus infection is much higher in developing than in developed countries, infection frequency is remarkably similar. In temperate climates, rotavirus disease is seasonal, peaking in winter. Rotavirus gastroenteritis is transmitted by the fecal-oral route and characterized by watery stools, vomiting and fever. Commercially available vaccines are effective in preventing infection.

The virus infects the mature enterocytes of the small intestine and induces structural changes in the intestinal epithelium, secretion of a viral enterotoxin by infected cells, impaired absorption, cellular and tight junction damage and stimulation of intestinal motility, leading to watery diarrhea.

Rotaviruses have a segmented double-stranded RNA genome (dsRNA) protected by a three- layered capsid resistant to the acidic pH of the stomach. The genome is composed of 11 segments coding for about 12 proteins. One of its features is that its dsRNA genome is never completely uncoated during replication. Only the outermost layer is lost following entry into the host cell. Replication of the viral genome thus occurs within a protective shell to avoid detection and degradation by the host cell.

Seven different species of rotavirus have been described: A, B, C, D, E, F and G. Humans are primarily infected by species A, but also by species B and C. All seven species cause disease in other vertebrates. As of this release, sequences representative of all currently known rotavirus A, B, and C species have been annotated in UniProtKB/Swiss-Prot. This represents 480 entries, from 100 distinct strains, 40 of which are of human origin.

In addition to manually annotated sequences and functional information, we paid special attention to viral taxonomy. We decided to follow the recent recommendations for genome-based classification, in addition to the older antigenic classification system. This allows us to better reflect the frequent rearrangements (segment exchanges) that occur between strains. As a result, a detailed taxonomy is provided for each DNA segment/protein (see for instance Q3ZK61).

For more detailed information on rotaviruses, see the ViralZone portal.

Release 57.2 of 05-May-2009

Fission yeast: the third eukaryotic complete proteome in UniProtKB/Swiss-Prot

Schizosaccharomyces pombe, the fission yeast, was isolated in 1893 by P. Lindner from East African millet beer, for which it was named, 'pombe' meaning 'beer' in Swahili. The genus name reflects both its relationship to budding yeast (-saccharomyces), and the most striking feature that distinguishes it from other yeast species, i.e. reproduction by fission (Schizo-). Although both S.pombe and Saccharomyces cerevisiae are yeasts, they are genetically as divergent from each other as both are from man. Unlike S.cerevisiae, S.pombe did not acquire its fame for its beer making talents - beer made with S.pombe seems to have quite a unsavoury acidic taste - but for the great scientific achievements its study permitted.

As mentioned above, fission yeast divides not by budding, but by medial fission, a process that resembles higher eukaryotic cell division. The organism grows exclusively through its cell tips and divides upon reaching the appropriate size, producing 2 daughter cells of equal sizes. Thus a simple measure of its length gives an estimate of which cell cycle phase the cell is in. This approach allows the isolation of cell cycle mutants (cdc), based on the presence of elongated cells due to continuous cell growth in the absence of cell division. This feature makes S.pombe a first-rate model organism to study cell division. The characterization of cdc mutants led to the discovery of cyclin-dependent kinases which was eventually awarded the 2001 Nobel Prize in Medicine.

In 2002, S.pombe was the 6th eukaryotic organism to have its genome fully sequenced. As of this release, it is the 3rd eukaryotic organism, after S.cerevisiae and Homo sapiens, for which the complete proteome is available in UniProtKB/Swiss-Prot. This set represents 4'957 manually curated protein sequence entries, containing data from the scientific literature and numerous cross-references, including links to GeneDB_Spombe, the fission yeast community database. A list of all S.pombe UniProtKB/Swiss-Prot entries is available in the pombe.txt file.

The S.pombe proteome set we provide today is not a static one. We will keep revisiting and updating the entries as the science develops further. Analysis of S. pombe and S.cerevisiae proteins, coupled with phylogenetic studies, will allow the identification and annotation of homologous proteins in other organisms.

Release 57.1 of 14-Apr-2009

Hepatitis Delta virus, a living fossil virus of the old RNA world?

Hepatitis delta virus (HDV) is unique in virology, and continues to fascinate since its discovery 30 years ago. HDV is a defective virus parasiting hepatitis B virus (HBV) infected cells. The clinical significance of HDV infection is more severe acute and chronic liver disease than that caused by HBV alone.

Only 1'680 nucleotides long, the HDV genome is the smallest known to infect man. The virus comprises one single gene, encoding the small Hepatitis Delta Antigen (S-HDAg). To compensate for this limited protein-coding capacity, HDV relies on a unique molecular mechanism to hijack host functions and the extraordinary dynamics of its RNA genome.

All known RNA viruses code for an RNA-dependent RNA polymerase to replicate/transcribe their genome, since eukaryotic host cells are unable to replicate RNA genomes. All but HDV; surprisingly, the S-HDAg seems to modify the activity of human DNA-dependent RNA polymerase II, turning it into an RNA-dependent RNA polymerase! Not only is this activity unique in molecular biology, but it also has many implications in the field of molecular evolution: life is thought to have started as RNA. HDV highlights the potential ability of human RNA polymerase II to switch back to an activity presumably forgotten for hundreds of millions of years.

HDV genome replication is further pushing its nostalgia for the ancient RNA world. Rolling circle genome replication produces a ssRNA composed of numerous repeats of the viral genome. All viruses known to use the rolling circle replication rely on proteins to cleave the genome concatemer. All but HDV; cleavage occurs via an autocatalytic ribozyme activity encoded in the RNA genome.

HDV needs HBV co-infection only to borrow its capsid and budding mechanism. This function is carried out by a longer isoform of HDAg with an additional 19 to 20 amino acids (L-HDAg). Again HDV relies on a unique mechanism to produce this isoform; the genomic RNA is edited at one specific site by a human RNA adenosine deaminase (ADAR1). Somehow, edited genomes are unable to replicate, assuring that the unedited version remains predominant.

The lesson from this quite unusual virus is that evolution does not always result in the creation of new tools, but sometimes it allows an existing tool to learn old and long forgotten tricks

Release 57.0 of 24-Mar-2009

A New major release is available (57.0)

Release 57.0 of 24-Mar-09 of UniProtKB/Swiss-Prot contains 428'650 sequence entries, comprising 154'416'236 amino acids abstracted from 177'584 references. 36'053 sequences have been added since release 56.0, the sequence data of 2'010 existing entries have been updated and the annotations of 368'500 entries have been revised.

The following improvements were carried out in the last 8 months:

Release 57.0 and TrEMBL release 40.0 are included in UniProt Knowledgebase release 15.0.

Release 56.9 of 03-Mar-2009

Hush, Little Fly...

Most organisms slumber and so do flies. As in humans, caffeine or amphetamines keep them awake, while antihistamines make them fall asleep. Not surprisingly, prolonged sleep deprivation can lead to lethality. These and other similarities prompted researchers to use Drosophila melanogaster as a model organism to study the genetic basis of sleep.

Sleep is regulated by two main processes: circadian and homeostatic. The first says it is time to sleep, the second signals the need to rest, independently of the hour of the day. In July 2008, Koh et al. showed that in Drosophila, mutations in a single protein, Quiver, well-renamed Sleepless by the authors, deeply perturb the homeostatic control. Loss of this protein causes an extreme reduction in sleep (>80%). About 9% of the flies don't sleep at all. Although the mutants had a shortened lifespan, they were still capable of flying and mating!

Quiver is thought to act through the regulation of the Shaker K+ channel, lowering membrane excitability by modulating its expression and activity. It could thus be a signaling molecule that links homeostatic sleep drive to neuronal excitability.

Although Quiver is well-conserved in other insect species and a potential ortholog has been identified in C.elegans, there are no obvious homologs in vertebrates. However, many members of the Shaker potassium channel family are known from yeast to humans.

If there is indeed a common mechanism for sleep control between humans and flies, one might envision relieving some forms of insomnia by acting on K+ channels. In the meantime, we advise you to keep counting sheep, or flies, when you can't get to sleep...

Quiver is now available in UniProtKB/Swiss-Prot and the first Protein Spotlight issue of this year has been devoted to this protein.

Release 56.8 of 10-Feb-2009

The UniProtKB/Swiss-Prot bronze medal is awarded to the plant Arabidopsis thaliana

With 7'764 manually annotated entries, Arabidopsis thaliana is now the third most represented species in UniProtKB/Swiss-Prot, behind Homo sapiens (human) and Mus musculus (mouse). This corresponds to about 25% of the complete proteome of A.thaliana, which can be retrieved from UniProtKB using the keyword 'Complete proteome'.

The members of the Plant Proteome Annotation Program (PPAP) are very proud of this third position. As shown in a study on Olympic medalists by Medvec et al. (1995), competitors who won the bronze medal are significantly happier with their award than those who won the silver medal. The silver medalists tend to be frustrated at having missed out on the gold, while the bronze medalists are simply happy to have received any honor at all.

Release 56.7 of 20-Jan-2009

UniPathway, a metabolic door to UniProtKB/Swiss-Prot

Due to the importance of using standardized nomenclature, annotations in UniProtKB/Swiss-Prot are progressively moving towards structured controlled vocabularies. In this context, the UniPathway project (a collaborative project involving the SIB and INRIA) aims at providing an extra resource dedicated to the exploration of metabolism using a structured controlled vocabulary for concisely describing the role of a protein in metabolism.

The metabolism of living organisms can be understood as a network of biochemical reactions, generally catalyzed by enzymes. Dealing with this network as a whole is a complex task and a classical approach is to divide it into more manageable segments, called pathways. This approach is always somewhat arbitrary and depends upon the final usage. Usually, a first level of segmentation is achieved on the basis of biological criteria. For instance, one could divide by considering the sub-network of all reactions involved in the amino-acid biosynthesis or, more specifically, in L-lysine biosynthesis only, or even more specifically, in L-lysine biosynthesis via the AAA pathway. It results in a series of coarse- to fine-grained divisions (the coarsest is called a 'super-pathway').

Whenever possible, we further refine this first-level segmentation to a second-level one, in order to split the pathways into linear segments (i.e. sub-networks without branches) called 'sub-pathways'. Such a fine-grained segmentation allows representation of pathway variants. Indeed, depending on an organism (or a set of organisms), the chemical route from one compound to another can be performed in different ways. It is important to represent these variations within the same pathway since UniProtKB covers a large number of species. In addition, it offers a convenient way to label the enzymatic reactions that constitute a metabolic pathway by their relative position ('step') in the sub-pathway.

The role of a protein in metabolism is described in the 'Pathway' subsection of the 'General annotation (Comments)' section. The syntax is 'super-pathway; pathway; sub-pathway: step n/m'. For examples of metabolic pathway annotations, see: P49367, P38998 and P11454. In this last example, the biochemical reactions of the pathway are not yet known. P11454 was therefore only annotated at the level of the pathway.

In the current version of UniProtKB/Swiss-Prot, close to 81'500 entries are annotated with the UniPathway controlled vocabulary. The UniProt web site supplies direct links to the UniPathway web server that provides more detailed information on pathways, sub-pathways and biochemical reactions.

Release 56.6 of 16-Dec-2008

GeneCards: yet another means to get human gene chromosomal location

UniProtKB aims to be a central hub for biological information on proteins. While the protein sequence is described in depth at the residue level in the 'Sequence annotation (Features)' section of UniProtKB/Swiss-Prot entries, the general context in which the protein exists and functions (mostly provided in the 'General annotation (Comments)' section) is kept at a general interest level. Users interested in more detailed information are invited to deepen their knowledge by looking into the original publications (in the 'References' section) and making use of the numerous cross-references, mostly found in the 'Cross- references' section that is becoming larger and larger with each release.

In the current release, we have added cross-references to GeneCards. This database focuses on human genes. The information provided by GeneCards is automatically extracted from more than 50 databases, some of which are manually annotated, such as OMIM and UniProtKB/Swiss-Prot. While much of the information provided by GeneCards overlaps with that found in UniProtKB/Swiss-Prot, it also contains additional data which complement our annotations.

GeneCards indicates very precisely the chromosomal location of each gene, not only at the chromosome (sub)bands, but also at the level of base pairs, clearly indicating from which end of the chromosome the position is calculated (see for instance ATP10A). This type of information is not currently provided directly in UniProtKB/Swiss-Prot entries, but can be accessed through links to other databases, such as Ensembl and now GeneCards. Note, however, that we provide a complete list of all human proteins, chromosome by chromosome, on the 'human-centric' page on the ExPASy server. For each chromosome, the list can be downloaded from the UniProt ftp site (see for instance all proteins encoded on chromosome 1).

Release 56.5 of 25-Nov-2008

The plastid: the most important organelle!

The world is full of plastids. Most of us know the green photosynthetic chloroplast which houses the machinery that fixes CO2 (with O2 as a "mere" by-product) and synthesizes sugars, lipids, amino acids, etc.; in short, the basis of our food chain. Found in plants and algae, chloroplasts are absolutely essential to life as we know it.

Plastids contain DNA; they are the remnants of a cyanobacterium that was engulfed by a eukaryotic heterotroph which had previously engulfed an alphaproteobacterium which eventually became the mitochondrion. These are primary endosymbiotic events; the organism that was taken up by the host was not digested but survived in the cytoplasm, eventually transferring genes to the host nucleus and being in effect enslaved. Most of these transferred gene products are imported back into their respective organelles using transit peptides. Plastids now encode between 28 and 250 protein-coding genes. The primary plastid endosymbiosis gave rise to 3 lineages: green algae, red algae and the glaucophytes. Subsequent engulfment of green or red algae by other eukaryotes has given rise to secondary endosymbionts, which in some cases have been engulfed again, sometimes with plastid replacement, to give an array of tertiary endosymbionts. These secondary and tertiary events gave rise to (among others) cryptophytes, diatoms, heterokont algae and apicocomplexa which are organisms that are no longer photosynthetic such as Plasmodium. To further complicate matters, it was thought that there were only 2 primary endosymbiotic events; recent work, however, on a thecate amoeba, Paulinella chromatophora, has cast doubt on this assumption.

Due to their small size, plastids are easily sequenced. A list of fully sequenced plastid genomes, their genes and the nomenclature of known plastid-encoded proteins can be found in our document plastid.txt.

In UniProtKB, we indicate whether a protein is encoded by plastid, mitochondrial or plasmid DNA in the 'Names and origin' section, 'Encoded on' subsection (OG line in the flat file). 6 categories have been created for plastids:

Currently, in UniProtKB/Swiss-Prot, there are close to 11'000 entries encoded by a plastid genome; 10'130 by chloroplasts, 145 by cyanelles, 142 by non-photosynthetic plastids, 18 by apicoplasts, 22 by chromatophores and 165 by unspecified types of plastids.

Release 56.4 of 04-Nov-2008

One thousand legs and a few toxins

Have you ever faced an elongated and dorso-ventrally flattened arthropod? If yes, it could have been a scolopendra or one of its cousins of the "numberless feet" family, i.e. the Myriapoda subphylum. If you were lucky enough not to be stung, you avoided intense local or irradiating pain, redness, edema, local hyperthermia, superficial necrosis, or even systemic symptoms such as nausea, emesis, sudoresis, anxiety and depression.

What is the cause of these symptoms? Information about scolopendra venom composition is very limited, probably due to the lack of severe systemic symptoms and fatalities in adults. However, in 2007, a group of researchers studied the neglected group of scolopenders (see Rates et al., 2007), using a structure-to-function proteomic approach in order to better understand the complexity of the venoms of two Brazilian scolopendra species: Scolopendra viridicornis nigra and Scolopendra angulata. 23 proteins have been characterized and their N-termini sequenced. As of this release, they are all available in UniProtKB/Swiss-Prot.

Release 56.3 of 14-Oct-2008

The Swiss Institute of Bioinformatics celebrates its 10th anniversary

The Swiss Institute of Bioinformatics (SIB), one of the 3 founder members of the UniProt Consortium, was established 10 years ago, on 30th March 1998, thanks to the enthusiasm and dedication of a small number of outstanding Swiss scientists. In these past 10 years, the SIB has evolved into a federation of 25 research and service groups based in 5 locations in Switzerland: Basel, Berne, Geneva, Lausanne and Zurich. It comprises a total of close to 300 members affiliated to the best universities and institutes of Switzerland.

The SIB has 3 main missions: research, services and training. It develops and maintains databases, such as UniProtKB/Swiss-Prot (in collaboration with the EBI and PIR), PROSITE, SWISS-2DPAGE, CleanEx, SWISS-MODEL Repository and STRING (in collaboration with the EMBL). It also creates and supplies software for the global life science research community, such as Melanie, MSight and SWISS-MODEL. It manages several bioinformatics core facilities that provide informatics and statistical support, services or advice to life scientists, thus enabling them to conduct their research projects and analyse the resulting data. The SIB is also responsible for a number of bioinformatics courses, which are part of the undergraduate curriculum of Swiss universities, as well as a Doctoral School open to graduate students.

The SIB 10th anniversary was celebrated during the whole year with various events throughout Switzerland, such as conferences and exhibitions for the public at large. However a landmark was reached on September 24th with a one-day conference, followed by a gala dinner peppered with music and speeches. Last, but not least, Zoltán Kutalik has been awarded the first annual SIB Young Bioinformatician Award. The award is for the "Ping Pong" code, published in Nature Biotechnology (May 2008), which allows to virtually check human cell lines for their sensitivity to thousands of drugs.

Happy anniversary SIB and many happy returns!

Release 56.2 of 23-Sep-2008

Additional bibliography information in UniProtKB

As a comprehensive and high-quality resource of protein sequence and functional information, UniProtKB strives to provide comprehensive literature citations associated with protein sequences and their characterization. Currently about 2 thirds of the UniProtKB PubMed citations are found in UniProtKB/Swiss-Prot, as a result of active integration in the course of manual curation.

In order to keep up with the explosive growth of literature and to give our users access to additional publications, we decided to integrate additional sources of literature from other annotated databases into UniProtKB. For this purpose we selected 5 external databases: Entrez Gene (GeneRIFs), SGD, MGI, GAD and PDB, and extracted citations that were mapped to UniProtKB entries. This additional bibliography is available from the 'References' section by clicking on 'Additional computationally mapped references'.

This procedure allowed the addition of about 283'000 PubMed citations in close to 110'000 UniProtKB entries. 85% of these references did not exist previously in UniProtKB.

In the future, we plan to apply this pipeline to more databases that could be used as sources of protein bibliography, including model organism databases, such as FlyBase and WormBase. We believe this additional protein bibliography information will allow our users to better explore the existing knowledge of their proteins of interest.

Release 56.1 of 02-Sep-2008

First draft of the complete human proteome available in UniProtKB/Swiss-Prot

The UniProt consortium is pleased to announce that a manually annotated representation of all the currently known human protein-coding genes is available in this release of UniProtKB/Swiss-Prot. This represents 20,325 entries. More than a third of these contain additional sequences representing isoforms generated by alternative splicing, alternative promoter usage and/or alternative translation initiation, resulting in close to 34,000 human protein sequences. Approximately 46,000 single amino acid polymorphisms (SAPs), mostly disease-linked, are also described, as well as 60,000 post-translational modifications (PTMs) (for additional statistics, click here).

It is not the first time that UniProtKB/Swiss-Prot has provided a fully annotated proteome set for a model organism (for example E.coli or S.cerevisiae) and there are many more planned in the near and more distant future (A.thaliana, B.subtilis, D.discoideum, mouse, rice, S.aureus, S.pombe, etc). But we do not expect that there will never be anything as important as this proteome. For the first time, we can present to the life sciences community a clean set of what we believe to be a full (although still imperfect!) representation of human proteins. It is the ultimate goal of the life sciences to fully understand Homo sapiens at the molecular level and we hope this set will significantly contribute to this extraordinary adventure.

There are still many challenging tasks in front of us. We will create entries for newly discovered human proteins, review and update the existing set, increase the number of splice variants, explore the full range of PTMs and continue to build a comprehensive view of protein variation in the human population. The characterization at the molecular level will need to be placed in its physiological context: subcellular location, tissue expression, protein/protein interaction, etc. And last but not the least, we all want to understand the role of all these actors of our life processes.

The way is paved, but the road will be long before we fully understand life at a molecular level.

Release 56.0 of 22-Jul-2008

A New major release is available (56.0)

Release 56.0 of 22-Jul-08 of UniProtKB/Swiss-Prot contains 392'667 sequence entries, comprising 141'217'034 amino acids abstracted from 172'036 references. 36'631 sequences have been added since release 55.0, the sequence data of 605 existing entries has been updated and the annotations of 356'036 entries have been revised.

The following improvements were carried out in the last 5 months:

Release 56.0 and TrEMBL release 39.0 are included in UniProt Knowledgebase release 14.0.

After almost one year of beta testing, the UniProt consortium is proud to announce the release of its new official unified website: a new interface, a new search engine and many new options to serve you better. The content of the various databases we provide is unchanged, except for all the improvements we keep carrying out with each new release. Many documents are available on the Documentation/help page, including FAQs. However, don't hesitate to contact us for any further questions, remarks or update requests.

Release 55.6 of 01-Jul-2008

Transient pleasures of the mind

Symmetry and round objects, including round numbers, easily fascinate the human mind. Thus, UniProtKB is happy to announce that we have a double set of round numbers to celebrate: UniProtKB/Swiss-Prot now contains over 50'000 cross- references to PDB and over 5'000 mammalian entries with experimental 3D-structures.

It is deeply satisfactory to see the 3D-structure of a protein. 3D-structures show the interactions between proteins and other macromolecules, and between proteins and small ligands, such as metal ions, substrates and inhibitors. Determining the 3D-structure is an important step for elucidating the mode of action of a well-characterized protein, and it provides a starting point for the classification of an uncharacterized protein and the prediction of its physiological role.

UniProtKB provides access to protein 3D-structures via cross-references to PDB (see for example P00734). The number of structures is constantly increasing, and quite frequently several structures have been determined for a given protein. Thus, the 50'000 cross-references to PDB in UniProtKB/Swiss-Prot correspond to more than 12'700 individual entries. Over 5'000 of these (about 40%) are from mammalian model organisms, including close to 3'300 human entries, while bacteria and archaea account for over 4'500 of the entries with links to PDB. Escherichica coli strain K12 is currently the best studied organism at the structural level, with 1035 out of its 4'339 proteins (almost 25%) having at least one link to a PDB entry. Close to 6'000 additional links to PDB are in UniProtKB/TrEMBL, corresponding to another 3'500 entries.

Thanks to the efforts of individual laboratories and structural proteomics groups, the number of experimental 3D-structures is rapidly increasing, and so the symmetrical roundness of the present numbers is a very transient phenomenon. Soon for every new protein there may be a family member with an experimental 3D-structure, even for membrane proteins. That is definitely something to look forward to.

Release 55.5 of 10-Jun-2008

Over 100 cross-references in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot was the first biomolecular database to include cross- references in its entries. As of this release, we provide our users with 101 explicit links (stored in the various distributed file formats, flat text, XML and RDF/XML) and 23 implicit links (available only from web servers, such as UniProt and ExPASy). Most cross-references can be found in the 'Cross-references' section of the entry (see for example Q9FK25), some are in the 'Sequence annotation' section (the Feature table in the flat file) (see for example cross-references to dbSNP in Q969T7). The dbxref.txt document provides a list of the databases cross-referenced in UniProtKB/Swiss-Prot. This document is available on the UniProt website and by ftp.

Additional links pepper almost every section of a UniProtKB/Swiss-Prot entry. They include cross-references to PubMed which are located in the 'References' section (see for example P0A790) and cross-references to the ENZYME database available through the EC numbers in the 'Names and origin' section (see for example Q00955). Moreover the 'Web resources' section is dedicated to databases or web pages that are specific for a single protein (see for example P04637). Note that the dbxref.txt document does not list these 'special' links.

Historically, a 'hundred' was a geographic division referring to the amount of land sufficient to sustain one hundred families. With over 120 cross-references, we hope to sustain many more research groups in quest of protein information.

Release 55.4 of 20-May-2008

Swiss-Prot in the Wonderland of protein names

Successful basic research requires various skills from scientists, not only creativity, but also precision, critical analysis of experimental results, reconsideration of the starting hypotheses, continuous controls and days, nights and weekends of - sometimes tedious - work in the lab. Thus when proteins are eventually purified, genes are cloned and a nice story is wrapped around the data, one of the rewards is to name the proteins/genes. There lies the fun.

Telling names can be useful for remembering a function or a phenotype. Interaction of Drosophila Cleopatra mutants with the asp gene product is lethal. Indeed, Cleopatra, Ancient Egypt's queen, allegedly committed suicide by way of an asp bite. Groucho mutants have more bristles than the norm on their face, much like Groucho Marx. Ken and Barbie protein mutants lack external genitalia... In Arabidopsis thaliana, Superman mutants have extra stamens (male genitals) in their flowers, and fans of the famous cartoon will not be surprised to learn that Kryptonite protein suppresses the function of Superman.

Acronyms are another part of the naming game. You would expect the RING1 protein to have a specific 3D structure related to its name, round, for instance. Actually, RING stands for "Really Interesting New Gene". In the same vein, you would not expect POSH to be any ordinary protein and yet all it contains are "Plenty Of SH3" domains! JAK1 kinase has two phosphate-transferring domains and was named after Janus, the Roman god of gates, usually depicted with two heads looking in opposite directions. However, JAK is also said to be 'Just Another Kinase', one among the hundreds of essential kinases described so far. And last, but not least, the Drosophila INDY protein refers to the movie "Monty Python and the Holy Grail", in which a live person about to be buried rightly protests: 'I'm Not Dead Yet!', which is hardly surprising since mutations in this gene result in a near doubling of the average adult life-span. For more amazing protein/gene names, see the excellent website established by Mikael Niku and Mikko Taipale.

Scientific creativity can be somewhat hampered by economical actors. The Pokemon oncogene for instance - which stands for POK erythroid myeloid ontogenic factor - had to be withdrawn after the US branch of Japanese video-game franchise Pokémon threatened researchers with legal action. The protein ended up with the far more sober - not to say boring - name of 'Zinc finger and BTB domain-containing protein 7A' (ZBTB7A).

Much ink has been spilled over the lack of standardization of protein names. Inconsistency among orthologs, family members and so on makes the systematic search through the literature a complicated task. UniProt provides a few guidelines for protein naming. Such a document should help to improve consistency, keeping a given protein's 'hypokeimenon', while not curbing creativity!

Release 55.3 of 29-Apr-2008

6 million entries in UniProtKB

Once upon a long, long time... This is how all fairy tales start, but was it really so long ago? No, it was December 2003 -just 4 and a half years ago, but it seems like ages -that UniProtKB was born. It was a beautiful baby, 1,220,020 entries fat and well supported by its 2 legs: the large TrEMBL and the small but knowledgeable Swiss-Prot. And the baby put on weight: on average 1,500 protein sequence entries per day in 2004 and during the first half of 2005. The more you have, the more you want, and from the middle of 2005 up to beginning of 2007, UniProtKB was integrating about 3,500 new entries per day. And it hasn't stopped since: currently we are integrating approximately 5,000 entries per day and this number keeps growing. As a result, we are happy to announce that UniProtKB has reached the significant milestone of 6,074,524 entries. Note that this tremendous growth is not due to the submission of environmental samples that are stored in another UniProt database: UniMES.

May we all live happily ever after and extract knowledge from this flood of data!

Release 55.2 of 08-Apr-2008

Dictyostelium discoideum on the move

Dictyostelium discoideum is a social amoeba known for its ability to alternate between unicellular and multicellular forms. Thanks to the availability of powerful molecular genetic tools, it is a convenient model to study fundamental cellular processes, such as cytokinesis, motility, phagocytosis, chemotaxis, signal transduction and aspects of development, including cell sorting, pattern formation and cell-type determination. It is one of 9 nonmammalian model organisms recognized by the National Institutes of Health (NIH) for their utility in the study of fundamental molecular processes of medical importance.

The 34 Mb genome of Dictyostelium discoideum was sequenced and assembled by an international consortium in 2005. Its gene-dense chromosomes encode approximately 12,500 predicted proteins, a high proportion of which have long, repetitive amino acid tracts.

In order to improve the coverage of functional annotation of Dictyostelium discoideum proteins, the UniProt consortium and dictyBase jointly organized a one-week Dictyostelium discoideum protein annotation jamboree in the Swiss Institute of Bioinformatics in Geneva last month. During this special event, more than 1,000 proteins were annotated by UniProtKB curators and about 30 gene models were corrected by dictyBase curators. In addition, more than 300 gene and protein names were standardized.

The close collaboration between UniProtKB and dictyBase will continue until the completion of Dictyostelium discoideum proteome annotation, planned for 2010.

UniProtKB/Swiss-Prot current release contains 1,803 fully annotated Dictyostelium discoideum entries, which represents about 15% of the complete proteome. A complete non-redundant set of Dictyostelium discoideum proteins can be retrieved from UniProtKB with the keyword 'Complete proteome'.

Release 55.1 of 18-Mar-2008

A small but deadly pathogen: Hepatitis B virus

The Hepatitis B virus (HBV) causes transient and chronic infections of the liver and constitutes a major cause of human disease. It is estimated that more than 5% of the global population carries the virus, and deaths from liver cancer caused by HBV probably exceed one million per year (see WHO factsheet). An effective vaccine has been available for nearly 20 years, but its high cost still hampers disease control in the developing world.

This killer virus has a surprisingly small genome, about 3.2 kb, which nevertheless encodes for 5 proteins through overlapping open reading frames. It replicates by reverse-transcribing genomic RNA to partial dsDNA through a unique mechanism, and thus belongs to a particular family: the hepadnaviridae.

The virus specifically infects hepatocytes, and most symptoms in an acute infection result from the killing of infected cells by the host immune system. In a few cases, the virus manages to down-regulate the host immunity and establishes a chronic infection. A viral protein secreted in blood is suspected to be involved in chronicity: the HbeAg protein may specifically deplete T-helper lymphocytes, thereby suppressing the ability to mount a strong cytotoxic response against infected hepatocytes.

Our current knowledge of the virus is rather poor due to the lack of cell culture systems allowing in vitro viral propagation. Much of what we know is derived from the study of other closely related hepadnaviridae, such as the woodchuck hepatitis virus (WHV) and the ground squirrel hepatitis virus (GSHV).

In the current UniProtKB/Swiss-Prot release, all hepatitis B virus entries have been updated, and 51 strains representative of the 8 genotypes infecting humans have been annotated. Animal hepatitis B viruses entries have also been revisited, notably WHV and GSHV.

Release 55.0 of 26-Feb-2008

New major release is available (55.0)

Release 55.0 of 26-Feb-08 of UniProtKB/Swiss-Prot contains 356'194 sequence entries, comprising 127'836'513 amino acids abstracted from 165776 references. 80'183 sequences have been added since release 54.0, the sequence data of 1'411 existing entries has been updated and the annotations of 262'009 entries have been revised.

The following improvements were carried out in the last 7 months:

UniProt Knowledgebase release 13.0 includes Swiss-Prot release 55.0 and TrEMBL release 38.0.

Release 54.8 of 05-Feb-2008

Over 20,000 fungal proteins manually annotated in UniProtKB/Swiss-Prot

Almost exactly one year after the integration of the complete proteome of Saccharomyces cerevisiae into UniProtKB/Swiss-Prot (see news), we have increased the number of manually annotated fungal entries to more than 20 000.

The fungal kingdom includes very diverse organisms, from unicellular to multicellular, from microscopic to macroscopic. Fungi have essential roles in many ecological processes. They are required for nutrient cycling within ecosystems, since they recycle dead organic matter into useful nutrients. Many plants would not survive without symbiotic fungi called mycorrhizae, which live in their roots and supply essential nutrients. They are also economically important as they provide numerous drugs (such as penicillin), food (such as mushrooms) and are used for their ability to ferment different sugars to produce bread, wine, beer and even soy sauce.

Fungi are also responsible for a great number of severe plant and animal diseases. Fungal infections, also called mycotic infections, may affect the skin or the internal organs of the body. Severe mycotic infections, such as histoplasmosis and candidiasis, are potentially life-threatening. Fungal diseases are very difficult to treat since fungi are eukaryotic organisms that share many properties with animal or human cells. Plant diseases caused by fungi include rusts and smuts, as well as leaf, root, and stem rot. They can cause severe damage to crop production.

Moreover, many fungi are important model organisms for studying the genetics and molecular biology of eukaryotes.

It is therefore not surprising that many fungi were targeted for the complete genome sequencing. No less that 32 complete fungal genomes have been submitted to public sequence databases to date. Using the S. cerevisiae and Schizosaccharomyces pombe fully annotated proteomes as templates, we are progressively annotating orthologous proteins in these newcomers, in order to provide our users with a high-quality fungal protein dataset that will better reflect the diversity of this kingdom.

Release 54.7 of 15-Jan-2008

Addition of more than 40'000 microbial entries derived from automated annotation

Thanks to genome sequencing efforts, there has been a tremendous rise in the number of submitted protein sequences. And this is only the beginning, as faster and cheaper sequencing methods will greatly increase the rate at which new genomes are sequenced.

Semi-automated annotation methods are necessary in order to provide the users with a maximum number of annotated protein sequences. The approach used by UniProtKB/Swiss-Prot differs from most other automated methods as the bulk of the annotation procedure is still performed manually, since we want to make sure that we produce high quality annotation with a minimal amount of incorrect inferences.

Our first automatic annotation project is called HAMAP, which stands for High-quality Automated and Manual Annotation of microbial Proteomes. In the context of this project, proteins from complete bacterial and archaeal proteomes, together with the related plastid proteins, are automatically annotated based on manually created family rules for complete protein annotation, with template-based feature propagation. We are very aware of the danger posed by automatic annotation procedures and have been extremely careful in the implementation of the pipeline, establishing many checks and conditional propagation in order to ensure that automatic annotation will produce data of a quality up to that of manual curation.

At this release, we have begun the procedure to integrate automatically into UniProtKB/Swiss-Prot the entries annotated by the HAMAP automated pipeline; over 40'000 bacterial and archaeal entries were integrated. This is the largest number of entries ever integrated at one release.

It must be noted that the planned introduction of 'evidence tags' should allow us to unambiguously flag whether an information item has been derived manually or automatically. For the time being, all entries annotated by the HAMAP pipeline have a cross-reference to HAMAP (for an example see entry Q02JM4).

Release 54.6 of 04-Dec-2007

Complete proteome for Arabidopsis thaliana in UniProtKB

Arabidopsis thaliana was the first plant to have its genome completely sequenced. A first round of annotation was performed in 2001 by the Arabidopsis Genome Initiative. The genome was later reannotated and is now maintained by The Arabidopsis Information Resource (TAIR) which assumes primary responsibility for Arabidopsis genome annotation.

As the genome sequencing was being completed, Swiss-Prot initiated the Plant Proteome Annotation Program (PPAP) whose main focus is the annotation of Arabidopsis (and rice) plant-specific proteins and protein families.

This ongoing program has so far produced more than 6'200 manually annotated Arabidopsis protein sequences in UniProtKB/Swiss-Prot. In addition, close to 44'000 Arabidopsis entries are available in UniProtKB/TrEMBL with a certain level of redundancy. Thus, the total number of protein sequences in UniProtKB for this model plant is much higher than the current estimate of 27'029 protein-encoding genes (see TAIR7 release of April 2007). To get around this problem, a non-redundant set of Arabidopsis proteins, including nuclear, mitochondrial and chloroplastic proteins, was created as of this release and the selected entries have been labelled with the keyword 'Complete proteome' to allow easy retrieval.

The current complete proteome set contains a total of 29'315 entries: 6'241 Arabidopsis thaliana in UniProtKB/Swiss-Prot and 23'074 in UniProtKB/TrEMBL.

Arabidopsis thaliana is the third 'green plant' (Viridiplantae) for which a complete nonredundant protein set has been created in UniProtKB. The other two are the unicellular green algae Ostreococcus tauri and Ostreococcus lucimarinus.

Release 54.5 of 13-Nov-2007

Acanthamoeba polyphaga mimivirus, a "giant" virus in UniProtKB/Swiss-Prot

Mimivirus (for mimicking microbe) is a new viral genus containing a single identified species, Acanthamoeba polyphaga mimivirus (APMV), discovered by Didier Raoult's lab in 1992 within the amoeba Acanthamoeba polyphaga while working on Legionellosis. The virion has a non-enveloped, icosahedral capsid with a diameter of 400 nm and protein filaments projecting from its surface. The capsid contains the internal core surrounded by an internal lipid layer. Its linear, double- stranded DNA genome is roughly 1.2 million bp in length, the largest viral genome known so far. Its replication cycle, genome and capsid structure place it into the nucleocytoplasmic large DNA viruses (NCLDVs), which include amongst others the poxviruses and iridoviruses.

This virus is amazing in many ways. It is the largest virus ever isolated, with a genome size and complexity comparable to that of a small bacterium. A thorough bioinformatics analysis carried out by the group of Jean-Michel Claverie uncovered 909 potential protein-coding genes. Some of these proteins belong to families that are shared with all or some NCLDVs, many have eukaryotic counterparts and there are quite a number of ORFans (no sequence similarity to proteins from other genomes). It was a surprise to find an appreciable number of genes coding for proteins involved in metabolism, DNA repair pathways and, most surprising, genes encoding a partially functional protein translation apparatus. Mimivirus does indeed encode four aminoacyl-tRNA synthetases (ArgRS, CysRS, MetRS, TyrRS), as well as various translation initiation, elongation and termination factors. It is very intriguing to find, in a virus, genes corresponding to central components of the protein translation machinery, a biochemical process widely thought to be an exclusive signature of cellular organisms.

The discovery of this amazing virus has lead to the concept of "giant" virus and implies that there is an overlap in terms of particle dimension, genome size, and genetic complexity between the viral and cellular organism worlds.

A special effort has been made in UniProtKB/Swiss-Prot database to provide the complete, fully annotated mimivirus proteome. We have also integrated all proteomics and structural information that has been made available by the groups of Jean-Michel Claverie and Chantal Abergel.

To get all UniProtKB mimivirus entries, click here.

Release 54.4 of 23-Oct-2007

More controlled vocabulary in the 'Subcellular location' subsection

Over 160'000 UniProtKB/Swiss-Prot entries (56%) contain a subcellular location description in the General Annotation section (CC lines in the flat file). We have standardized the content of these comments with the concomitant creation of a controlled vocabulary and a new, parsable flat-file format.

The subcellular location controlled vocabularies are stored in a new document (subcell.txt) which provides, for each individual UniProtKB location, topology or orientation term, the corresponding definition, as well as other relevant information, such as synonyms, hierarchies or mapped GO terms.

The format of the subcellular location subtopic has changed from free text to a more structured format. When required for the accurate description of a complex biological situation, free text is still used in the 'Note' (see for example O43918). In addition, since release 53.0, this subsection can occur more than once per entry, allowing specific annotation for each isoform, chain or peptide in separate subsections.

Release 54.3 of 02-Oct-2007

Oryza sativa (rice) species separated into japonica and indica subspecies in UniProtKB/Swiss-Prot entries

Although it has been a rule in UniProtKB/Swiss-Prot to merge all protein sequences encoded by the same gene in one species into a single record to avoid redundancy, this rule sometimes has to be adapted to specific cases. For example, this rule applied to rice entries, causing sequences from various rice cultivars to be merged and entries tagged with the unique taxonomic identifier (ID) for Oryza sativa species: 4530.

However, O.sativa comprises 2 subspecies: japonica and indica. A classification at subspecies level is already effective in several databases, including UniProtKB/TrEMBL, and most scientists use it when submitting new sequences. In EMBL/DDBJ/GenBank, there is over 1.2 million japonica and almost 360,000 indica sequences, coming mainly from large scale genome, cDNA or EST sequencing projects. The completion of both japonica and indica genomes and the analysis of multiple sets of subspecies-specific transcripts revealed a significant number of sequence variations and a divergence of expression pattern between japonica and indica subspecies. In order to provide a clear information to its users, UniProtKB/Swiss-Prot had to adopt this classification and separate indica and japonica subpecies in rice entries.

Most rice entries contained exclusively japonica sequences and were quickly updated with the appropriate taxonomic ID. But over 220 rice entries contained merged sequences of japonica and indica subspecies and had to be "de-merged". This task was undertaken by the PPAP (Plant Proteome Annotation Program) team. Common information was kept in both japonica and indica entries, while expression patterns or other subspecies-specific experimental evidences were transferred where they belong. Today all rice entries are classified into either japonica or indica subspecies, with the exception of very few entries where subspecies was not specified. When available, cultivars are indicated in the reference section. Each entry also provides cross-references to either japonica (cultivar nipponbare) or indica (cultivar 93-11) genomic sequences.

The gene nomenclature system ('Os' code) defined by RAP-DB and/or TIGR for the japonica cultivar nipponbare can be found in japonica entries in the gene names subsection (Ordered Locus Names). RAP-DB locus identifiers are listed in the rice.txt file.

To get all UniProtKB Japonica entries, click here.

To get all UniProtKB Indica entries, click here.

To get all UniProtKB rice entries, click here.

The mnemonic species identification code in the entry name allows to quickly identify to which subspecies the protein belongs: ORYSJ is the code for japonica, ORYSI for indica and the old ORYSA code indicates that the subspecies is not specified. The list of rice cultivars can be found in the strains.txt file.

Release 54.2 of 11-Sep-2007

Yeast PDR5: the first adopted protein in UniProtKB/Swiss-Prot

While progress in laboratory techniques allows the production of an ever- increasing flood of data, these data are still insufficiently exploited. One reason for this bottleneck is the lack of efficient integration into databases, making data more difficult, sometimes almost impossible, to access. The current information flow consists in two steps. First, scientists providing knowledge encode it in the format of a given journal. Then database curators have to decode and standardize it to make it computer-parsable and usable for the further research.

In order to reduce this time-consuming and error-prone process and to make the most of expert scientists, UniProtKB/Swiss-Prot proposes a new strategy called 'Adopt a Protein', where researchers can adopt one or more specific proteins. 'Foster parents' make sure that the information concerning their favourite protein(s) is up-to-date. UniProtKB/Swiss-Prot provides them with a draft with the correct sequence, up-to-date sequence analysis predictions and a description of the main topics that require annotation, such as protein names, bibliographic references, comments and protein features. The input of 'foster parents' is acknowledged in the entry.

The yeast Saccharomyces cerevisiae is a popular model organism used in hundreds of laboratories around the world and its genome has been fully sequenced and extensively studied over past a decade. Moreover, the yeast community has a long tradition of sharing information. Therefore, the yeast proteome has been chosen as a test platform to initiate the 'Adopt a Protein' scheme.

This release contains the first fully annotated adopted protein: PDR5. PDR5 is a 160-kDa yeast pleiotropic ABC efflux transporter of multiple drugs localized in the plasma membrane. It belongs to the ABC (ATP-binding cassette) transporter family, PDR subfamily. The PDR subfamily is specific to fungi and plants and exhibits distinctive structural features, such as an unusual alternation of nucleotide binding and membrane domains, a pair of extended extracellular loops and a degenerate ATP binding domain. Yeast strains lacking PDR5 are used for toxicity tests, whereas those overexpressing PDR5 are used for screening antifungal sensitizers.

PDR5 has been adopted by Professor André Goffeau from the Catholic University of Louvain (Belgium). We are grateful to him for committing precious time to help producing an annotation useful to the whole community. We hope that PDR5 is only the first member of a big adopted family! If you want to become a 'foster parent', please contact the UniProtKB/Swiss-Prot Fungal Proteome Annotation Program (FPAP).

Release 54.1 of 21-Aug-2007

More than 18'500 phosphorylation sites identified by mass spectrometry in UniProtKB/Swiss-Prot

Phosphorylation is a key reversible modification that regulates protein function, subcellular localization, stability, and interactions. It is believed that up to 30% of all eukaryotic proteins may be phosphorylated.

During the last few years, phosphoproteomics have greatly improved due to the optimization of enrichment protocols for phosphoproteins and phosphopeptides, better fractionation techniques using chromatography, and improvement of mass spectrometry instrumentation. Thanks to these developments, it is now possible to analyze entire phosphorylation sets rapidly. However, protein and phosphorylation site identification by mass spectrometry is crucially dependent on the quality and completeness of the biological resource used for analysis.

In UniProtKB/Swiss-Prot, we make a special effort to document post- translational modifications and especially phosphorylation sites, using data from the literature.

We have incorporated data from 38 high-quality phosphoproteomics studies which have allowed us to annotate or confirm 18'556 phosphorylation sites in 6'493 protein entries, mainly from human (45%), mouse (27%) and yeast (25%), but also from rat, Arabidopsis thaliana and bacteria. These high-throughput studies can be easily recognized among other UniprotKB references through the [LARGE SCALE ANALYSIS] tag appearing in the RP line.

Click here to obtain the complete list of UniProtKB/Swiss-Prot entries having at least one phosphorylation site found in proteomic studies.

Release 54.0 of 24-Jul-2007

New major release is available (54.0)

Release 54.0 of 24-Jul-07 of UniProtKB/Swiss-Prot contains 276'256 sequence entries, comprising 101'466'206 amino acids abstracted from 158'294 references. 7'104 sequences have been added since release 53.0: this represents a 3% increase. In addition, the sequence data of 690 existing entries have been updated and the annotations of 269'152 entries have been revised.

The following improvements were carried out in the last 2 months:

UniProt Knowledgebase release 12.0 includes Swiss-Prot release 54.0 and TrEMBL release 37.0.

Release 53.3 of 10-Jul-2007

Knottins or how to knit in the protein world

Knottins (also called inhibitor cystine knots or ICKs) are small disulfide-rich proteins characterized by a special "disulfide through disulfide knot". This knot is obtained when one disulfide bridge crosses the macrocycle formed by two other disulfides and the interconnecting backbone (disulfide 3-6 goes through disulfides 1-4 and 2-5).

The knottin structure is found in many unrelated families, such as plant protease inhibitors, cyclotides, toxins from cone snails, spiders, insects, horseshoe crabs and scorpions, gurmarin-like peptides, agouti-related proteins, and antimicrobial peptides.

In collaboration with Laurent Chiche (CNRS, Montpellier), about 450 UniProtKB/Swiss-Prot entries have been updated with knottin structural information. They can be retrieved with the newly introduced keyword Knottin.

Examples:

Release 53.2 of 26-Jun-2007

Obesity in the spotlight

Over the last 40 years, overweight and obesity have become a central health issue in a growing number of countries. Obesity comorbidities are severe and include cardiovascular diseases, diabetes, musculoskeletal disorders and some cancers. The two fundamental causes of obesity are clearly identified as an increased intake of high-fat and energy-dense diets and a decrease of physical activity. However, there is growing evidence that certain gene products have a direct or indirect influence on body mass.

In 1999, the mouse Fto gene was cloned and called Fatso, because of its large size (at least 250 kb). By a curious coincidence, the human orthologous protein was recently shown to predispose to childhood and adult obesity. The main culprits are intronic variations in the FTO gene. Carriers of one (or two) inherited copy (copies) of the variants have an increased risk of obesity of 30% or 70%, respectively. The function of Fatso is not yet known. This protein, along with other proteins involved in the development of obesity, can be retrieved from the UniProtKB/Swiss-Prot using the keyword Obesity.

Release 53.1 of 12-Jun-2007

4'000 bovine entries in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot is happy to announce the annotation of over 4'000 entries of a very popular animal in Switzerland, almost a national symbol: Bos taurus, in other words the cow.

Those of you who have visited the Swiss Alps know that their gorgeous scenery is definitely associated with the sound of cowbells in summer pasture. Similarly, the modern biology landscape would be poorer without bovine sequences, obviously not in a decorative role, but as a key element for our understanding of human biology.

The domesticated cow is extensively used in biomedical research, as an animal model and also as a source of biological material. Remember that bovine insulin was the first sequenced protein and was used for decades to treat diabetes. The first draft of the bovine genome sequence was released in October 2004 by the Human Genome Sequencing Center of the Baylor College of Medicine. The human and bovine genomes are more similarly organized than when either is compared to the mouse. Despite its interest, only a few large scale cDNA sequencing projects have been initiated. Currently more than 70% of the UniProtKB/Swiss-Prot bovine sequences come from translation of cDNA sequences produced the NIH Mammalian Gene Collection and the Agricultural Research Service, US Department of Agriculture.

Release 53.0 of 29-May-2007

New major release is available (53.0)

Release 53.0 of 29-May-07 of UniProtKB/Swiss-Prot contains 269'293 sequence entries, comprising 98'902758 amino acids abstracted from 156'204 references. 9'228 sequences have been added since release 52.0: this represents a 3.5% increase. In addition, the sequence data of 734 existing entries have been updated and the annotations of 210'454 entries have been revised.

The following improvements were carried out in the last 3 months:

UniProt Knowledgebase release 11.0 includes Swiss-Prot release 53.0 and TrEMBL release 36.0.

So far metagenomic and environmental sequences were missing from UniProt, this gap is now filled with the introduction of a new ftp directory, UniMES, that allows download and subsequent analysis of these sequences of growing importance.

Release 52.5 of 15-May-2007

Links to wikipedia

While UniProt is a central resource for biologists, some specialized information is beyond the scope of our database. Therefore we link UniProtKB entries to more specialized resources:

We recently added links to the free encyclopedia Wikipedia in the web resource section. Proteins with a link to Wikipedia are mainly of medical or pharmaceutical interest. Wikipedia articles may describe the discovery of the protein and its use in medicine.

Examples:

Release 52.4 of 01-May-2007

T Rex and us

We have introduced the oldest fossil protein sequence to date into UniProtKB/Swiss-Prot, i.e. several peptides from collagen (P0C2W2, P0C2W3, P0C2W4) which were extracted from a 68 million year-old dinosaur: Tyrannosaurus rex . These collagen sequences were obtained by mass spectrometry analysis directly from soft tissue that remained in fossilized bones, which were unearthed from rocks in the Hell Creek Formation of eastern Montana, US.

Interestingly, Tyrannosaurus rex collagen is similar to chicken collagen, and similarities have also been found with frog and newt protein. The finding is consistent with the idea that we can trace a direct evolutionary line between birds and dinosaurs (for more information: PMID 17431180.)

The discovery of protein in bone soft tissue of dinosaur is a surprise - it was not thought that such organic material could survive this long. "The pathways of cellular decay are well known for modern organisms. And extrapolations predict that all organic matter vanishes within 100,000 years, maximum" (BBC news).

Until now, the oldest fossil protein sequence in UniprotKB/Swiss-Prot was a RuBisCO large subunit from a fossil leaf of a Miocene (17-20 million years old) Magnolia, P30828 (see headline release 43.1 of 13-Apr-2004)

You can get all these aged proteins by clicking on the keyword Extinct organism protein.

Other reference:

Protein Spotlight (May 2004) Small blast from the past

Release 52.3 of 17-April-2007

More than 630 F-box proteins from Arabidopsis thaliana in UniProtKB/Swiss-Prot

F-box proteins play a major role in the ubiquitin conjugation pathway. There are involved in the third step of this pathway. Most of the F-box protein contains a conserved F-box domain near the N-terminus and a variable region. The F-box domain can interact with Cullin and one of the SKP1 proteins to form a E3 SCF (SKP1/Cullin/F-box) ubiquitin ligase complex. The variable region interacts with a specific protein, which is, in turn, ubiquitinated and thus targeted to protein degradation. This variable region confers the specificity of the SCF complex.

The whole set of Arabidopsis thaliana, more than 630 F-box protein sequences, has been manually reviewed and integrated into UniProtKB/Swiss-Prot. About 120 wrong gene model predictions have been corrected, including 26 F-box proteins obtained by splitting erroneous gene predictions covering more than one gene. This represents one of the largest protein family of a given species that had ever been integrated into UniProtKB/Swiss-Prot.

In A. thaliana, almost half of F-box proteins contains a combination of different domains which is used to define subgroups:

>300FBF-box alone
91FBLF-box associated with LRR-repeat
124FBKF-box associated with Kelch-repeat
30FBDF-box associated with FBD
41FDLF-box associated with FBD and LRR-repeat
4FBLKF-box associated with LRR-repeat and Kelch-repeat

Among this large protein family, less than 30 members have been characterized: their functions are various and include flowering, circadian cycle, hormone signaling, and plant defense.

Related entries:

Release 52.2 of 03-April-2007

Update of a spider dermonecrotic toxin family

Loxosceles is the genus of spiders that includes the infamous brown recluse spider Loxosceles reclusa. These spiders, also called violin spiders or fiddleback spiders because of violin-like marks on their cephalothorax, are brownish-yellow in color, and spin small, irregular webs under rocks, or in nooks and crannies of your house. These spiders are found in the USA, South America, Europe and Africa. Their most characteristic feature is actually their eyes: most spiders have eight eyes, but Loxosceles have six, arranged in three pairs, or dyads, that sit side-by-side.

The bite of a Loxosceles spider is not deadly, but it is very unpleasant - the venom is necrotoxic, causing tissue to die and fall off. Pain usually doesn't begin until 6-12 hours after the bite occurs. Loxosceles' necrotoxic venom is cytotoxic and hemolytic. It contains at least 8 enzymes. The enzyme thought to be responsible for most of the destructive effects is called Sphingomyelinase D. This enzyme catalyzes the hydrolysis of sphingomyelin and causes hemolysis and dermonecrosis.

The annotation of this family of toxin has just been updated in UniProtKB/Swiss-Prot (e.g. Q8I914 and P83045).

Release 52.1 of 20-March-2007

Koala genome invaded by a new retrovirus

Endogenous retroviruses are vestiges of ancestral viral infection that have been incorporated long time ago into a host's genome. Surprisingly, 8% of the human genome is composed of such "fossil" viruses (1). The most recent endogenization event is a porcine virus that entered its host approximately 5,000 years ago.

Recently a new endogenous retrovirus was identified in Australia koala populations.

Koalas were largely exterminated on mainland southern Australia in the late nineteenth century. Populations were established on a small number of islands in the early 1900s and have remained isolated since 1920s. These populations have since been used to restock the mainland.

The new Koala retrovirus (KoRV) has only been found in mainland populations, suggesting that this virus entered koala species in the last 100 years (2). This retrovirus is both endogenous and fully functional, meaning that it spreads both by contact and by heredity, and is still in the process of invading the koala genome. KoRV is very similar to Gibbon Ape Leukemia Virus (GALV), and these two retroviruses are thought to have diverged very recently. This suggests a scenario in which a monkey retrovirus has crossed species to enter newly established koala population and has started to colonize koala genome.

The KoRV is unique in that we are observing the initial entry of a new family of endogenous retrovirus into a wild host genome. The dynamic interaction between this virus and its new host provides a unique opportunity to study the process of endogenization and its impact on species development and evolution.

Related entries

References

1. Griffiths D.J.
Endogenous retroviruses in the human genome sequence
Genome Biology 2:reviews1017.1-1017.5 (2001).

2. Tarlinton R.E., Meers J., Young P.R.
Retroviral invasion of the koala genome
Nature 442:79-81 (2006)

Release 52.0 of 06-March-2007

New major release is available (52.0)

UniProt Knowledgebase release 10.0 includes Swiss-Prot release 52.0 and TrEMBL release 35.0.

Release 52.0 of 06-Mar-07 of UniProtKB/Swiss-Prot contains 260'175 sequence entries, comprising 95'002'661 amino acids abstracted from 152'564 references. 18'986 sequences have been added since release 51.0: this represents an increase of 7.3 %. In addition, the annotations of 190'910 entries have been revised.

Many improvements were carried out in the last 4 months:

UniProtKB/Swiss-Prot (flat file version) turned 1 Gigabyte (GB) long on this major release ! For comparison, the human genome contains 0.791175 GB of data (the 3.1647×10 9 base pairs represented as 2-bits) (wikipedia)

Release 51.7 of 20-Feb-2006

Complete human kinome in UniProtKB/Swiss-Prot

Phosphorylation by protein kinases is a universal and fundamental cell- signalling process in eukaryotic cells. A comprehensive catalog of predicted human kinases has been published in 2002 (Manning et al.).

We have annotated the 518 protein kinases predicted to exist, and when necessary revised their sequences. The human kinome as defined by Manning et al., is now complete in UniProtKB/Swiss-Prot !

These protein kinases are subdivided in 10 groups

In addition to these 518 protein kinases, there is currently one family of lipid kinases which is being fully characterized: the phosphatidyl 3- kinase (PI3 kinase) family (PI3 kinome). This emerging family appears to also include phosphatidyl 4-kinase (PI4 kinases). PI4 kinases as well as PI3 kinases share the same catalytic kinase domain. However, they are distantly related to the catalytic domain of the protein kinases and as a consequence belong to a separate family. This lipid kinase family will be soon integrated into UniProtKB/Swiss-Prot.

Mouse kinase orthologs are in the process of being all integrated into UniProtKB/Swiss-Prot. By providing annotated and up-to-date human and mouse kinomes to the scientific community, our knowledgebase becomes a central and reference portal for kinases.

Release 51.6 of 06-Feb-2007

One million comment lines in UniProtKB/Swiss-Prot!

Annotation is the focal point of our effort to maintain and develop UniProtKB/Swiss-Prot. Many of our manual annotation is found in the comment lines, which aim to provide a summary of what is known about a protein. There are 27 different types of comment line, which are arranged according to what we designate as 'topics'.

Recently, we reached a peak of 1 million CC topic lines. About 97 % of the UniProtKB/Swiss-Prot entries contains at least one CC topic line and, currently, there is an average of 4 different CC topic lines per entry.

Comment lines are mainly free text, but we have already set up a standardised format as well as the use of controlled vocabularies for several topics (ALTERNATIVE PRODUCTS, BIOPHYSICOCHEMICAL PROPERTIES, CATALYTIC ACTIVITY, DISEASE, INTERACTION, MASS SPECTROMETRY, PATHWAY, RNA EDITING, SIMILARITY, TOXIC DOSE...). Standardisation for two further topics - SUBCELLULAR LOCATION and CAUTION - are also on their way (more: Forthcoming changes)

The most represented CC topics in UniProtKB/Swiss-Prot are:

Such a distribution reflects the type of experimental biological data which is available for a protein sequence nowadays in the scientific literature.

The data found in UniProtKB/Swiss-Prot, are continuously updated and - since annotators are constantly improving their skills in literature-based information retrieval - the 'depth' of manual annotation is always increasing. This is highlighted by the fact that we have increased the average number of CC topics per entry from 3.5 to 4 since March 2004 (see also the release statistics).

Release 51.5 of 23-Jan-2007

Reintroduction of the initiator methionine

In UniProtKB/Swiss-Prot, the sequence data corresponds to the precursor form of a protein, i.e. before post-translational modifications such as cleavage of the signal peptide or other processing. However, for historical reasons, a notable exception was made: when the initiator methionine was post-translationally removed, the sequence stored in UniProtKB/Swiss-Prot did not include the methionine and instead started with the second residue.

As a consequence, our sequence data differed from that shown in other sequence databases where the initiator methionine is usually not removed. This discrepancy was confusing for users and was the subject of one of the most frequently asked questions to UniProtKB/Swiss-Prot.

This is no longer the case. With this release, all initiator methionines have been reintroduced to the UniProtKB/Swiss-Prot entries (over 10'000) from which it is cleaved. This caused a major change, since all amino acid positions described in these entries have now been updated to reflect the new sequence numbering.

The cleavage of the initiator methionine is still indicated by the INIT_MET line in the feature table but the sequence position is 1 instead of 0. We also added the comment Removed in the description field of INIT_MET line to indicate that the initiator methionine is indeed removed post-translationally.

Example P51487:

Previous format:

FT   INIT_MET      0      0       
FT   CHAIN         1    400       Phosrestin-1.
...
SQ   SEQUENCE   400 AA;  44781 MW;  DA786D7E9FFB4A29 CRC64;
      VVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY

New format:

FT   INIT_MET      1      1       Removed.
FT   CHAIN         2    401       Phosrestin-1.
...
SQ   SEQUENCE   401 AA;  44912 MW;  1212C2422CD35A94 CRC64;
     MVVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY
Release 51.4 of 10-Jan-2007

Complete yeast proteome in UniProtKB/Swiss-Prot

Brewer's yeast or baker's yeast are two common names for the species Saccharomyces cerevisiae, for which the scientifically correct name was first applied to a strain observed in malt circa 1837. These common names neatly reflect the major interests this organism holds for the majority of people. It is one of the earliest "domesticated" organisms, and while initially appreciated for its alcohol producing or dough leavening capabilities, the simple yeast soon became an important organism for research too.

The ease with which yeast can be cultivated and genetically manipulated made it a useful tool in the early days of biotechnological and biomedical research, where it was utilized for the production of pharmaceuticals and enzymes (a name that originates from the latin 'en zymi' = in yeast). S.cerevisiae has subsequently proven to be an extremely useful experimental model system for the study of the basic biological structures and processes of the eukaryotic cell. It is therefore not surprising that it was one of the first eukaryotic species targeted by large-scale sequencing efforts, and in 1996, researchers were able to celebrate the completion of the first eukaryotic genome sequence.

One decade later, and coincident with the 20th anniversary of Swiss-Prot, yeast is again in the headlines, representing the first complete eukaryotic proteome integrated into Swiss-Prot, the manually curated section of the UniProt knowledgebase. In the current release of UniProtKB/Swiss-Prot there are more than 6'000 yeast entries containing every gene of the yeast genome believed to code for a protein. Each entry contains literature-curated annotations and numerous cross-references, the locus identifier, which maps a protein to its corresponding genomic locus, and a cross-reference to the Saccharomyces Genome Database (SGD), the community-designated repository for the reference genome sequence. A summary of all yeast entries including these references is listed in the file yeast.txt.

In the 10 years since the initial release of the S.cerevisiae genome, the annotation of protein encoding genes has continually evolved. New open reading frames have been identified and existing predicted ORFs have been revised or retired. In collaboration with SGD we have revisited and updated all entries for which the protein sequence has been changed since the initial release in order to provide users with a set of yeast proteins that corresponds to the most current view of the yeast proteome.

Ten years of post-genomic research have yielded a wealth of information on yeast proteins and we will continually revisit yeast entries to update their functional annotation. S.cerevisiae continues to be at the forefront of experimental molecular biology, particularly in the field of proteomics, and the availability of the complete proteome in UniProtKB/Swiss-Prot will facilitate the mapping and integration of results from large-scale proteomic studies. S.cerevisiae will also serve in the future as one of the model systems for functional annotation in UniProtKB/Swiss-Prot. As one of the best-characterized of the eukaryotic organisms, its proteins will provide many templates for the creation and annotation of fungal-specific or broader eukaryotic protein families.

Release 51.3 of 12-Dec-2006

Major update of a re-emerging pathogen: Dengue virus

Dengue is a mosquito-borne virus found in tropical and sub-tropical regions around the world, predominantly in urban and semi-urban areas in Southeast Asia, Africa, and South America. Dengue virus is transmitted through the bite of Aedes aegypti mosquitoes.

In the 1970s, the disease had recessed due to an active vector control program. But since the 1980s, both the virus and his vector have re-emerged and spread even more than before: the disease is now found in more than 100 countries. The reasons of this re-emergence might be the growing extension of urban areas and the arrest of the vector control program.

The virus is transmitted to humans by mosquito bite, it replicates in skin dendritic cells before infecting lymph nodes and blood cells. The symptoms are fever and pain that can be sustained for up to 7 days. In rare cases, human infection leads to dengue haemorrhagic fever (DHF), a potentially lethal complication. Today DHF affects most Asian countries and has become a leading cause of hospitalisation and death among children in several of them.

Some 2500 million people -- two fifths of the world's population -- are now at risk from dengue. WHO currently estimates there may be 50 million cases of dengue infection worldwide every year. The 2006 mild autumn has favoured long term spread of the vector and has been responsible for a major outbreak of dengue in India, with many cases in New Delhi.

The growing number of dengue virus sequences (more than 3400 in UniProtKB/TrEMBL) and the absence of taxonomic nomenclature does not facilitate identification of medical samples.

In the current UniProtKB/Swiss-Prot release, a systematic nomenclature has been adopted for 28 representative dengue strains, indicating the country and the year of isolation besides the strain name.

Example: Dengue virus type 2 (strain TH-36)
becomes: Dengue virus type 2 (strain Thailand/TH-36/1958)

The virus (+)RNA genome codes for a single polyprotein, cleaved into more than 12 products. 32 representative dengue virus polyproteins have been annotated and are available from UniProtKB/Swiss-Prot (e.g. P33478).

Release 51.2 of 28-Nov-2006

All known human G protein-coupled receptor proteins in UniProtKB/Swiss-Prot

The Human Proteome Initiative (HPI) aims to annotate all known human protein sequences, as well as their mammalian orthologs. The G protein-coupled receptor proteins (GPCRs), also known as seven transmembrane receptors (7TM receptors) form one of the largest proteins family in mammalian genomes. These proteins are involved in all types of stimulus-response pathways, from intercellular communication to physiological senses, including taste, smell, and vision (opsins receptors). Many diseases are linked to GPCRs and half of the drug products by the pharmaceutical industry are targeted against GPCRs. A special emphasis has been given to this family in the HPI project.

In the current release, all known and potential human G protein-coupled receptor protein are annotated and integrated in UniProtKB/Swiss-Prot. 775 human GPCRs are now available in our knowledgebase. About half of all GPCRs are presumed to be involved in the sense of smell. For the remaining half, the active ligand has been documented when available, but about 20% of human GPCRs are still orphans. Most of mouse and rat orthologs have been annotated.

All G protein-coupled receptor proteins annotated in UniProtKB/Swiss-Prot are classified by family and listed in the file 7tmrlist.txt.

Release 51.1 of 14-Nov-2006

CD antigens: molecular markers of cell differentiation

The CD nomenclature was proposed and established in 1982 at the first International Workshop and Conference on Human Leukocyte Differentiation Antigens (HLDA). This nomenclature system was intended for the classification of monoclonal antibodies (mAbs), generated in many laboratories around the world, against various cell surface molecules on leukocytes (white blood cells). The data were collated and analyzed by the statistical procedure of 'cluster analysis'. This analytical method identified clusters of antibodies with very similar patterns of binding to leukocytes at various stages of differentiation: hence the use of the abbreviation 'CD' for 'cluster of differentiation'. CD antibodies are used widely for research, differential diagnosis, monitoring and treatment of disease.

The HLDA workshops assign each CD on the basis of the reactivity of at least two mAbs to one human antigen; the provisional indicator 'w' (for example CDw293) is sometimes given to an imperfectly characterized cluster or to a cluster represented by only one mAb.

Gradually the use of the CD nomenclature has expanded to many other cell types such as endothelial and stromal cells. Therefore the 8th HLDA conference (HDLA8) decided in 2004 that the acronym HLDA would be succeeded by HCDM for "Human Cell Differentiation Molecules".

All currently defined human CD antigens (a total of 361 in this release) are annotated and integrated in UniProtKB/Swiss-Prot. A CD antigen appears in an entry as a synonym of the protein name (e.g. CD305 antigen for Leukocyte-associated immunoglobulin-like receptor 1). The CD name is also propagated to all orthologous mammalian proteins, so that human CD antigens and their orthologs in other mammals can be retrieved easily.

Release 51.0 of 31-Oct-2006

New major release is available (51.0)

Release 51.0 of 31-Oct-06 of UniProtKB/Swiss-Prot contains 241'242 sequence entries, comprising 88'541'632 amino acids abstracted from 148'048 references.

19'061 sequences have been added since release 50.0, the sequence data of 1'336 existing entries has been updated and the annotations of 222'181 entries have been revised.

Many improvements were carried out in the last 5 months:

All the recent changes to the UniProt Knowledgebase format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 9.0 includes Swiss-Prot release 51.0 and TrEMBL release 34.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 50.9 of 17-Oct-2006

Human polymorphisms: juggling with health and disease

Recent advances in genomics and proteomics promise to give new insights into the molecular mechanisms of diseases and hopefully will lead to the discovery of novel treatments. The integration of phenotype descriptions along with sequence data, genetic information, as well as physiological, biochemical and structural knowledge may help understand the chain of events leading from a molecular defect to a pathology. In this context, UniProtKB/Swiss-Prot provides the scientific community with a wealth of information on genetic diseases, disease-linked variants and polymorphisms.

In the current release, over 2'000 human entries contain a disease description in the comment section under the topic DISEASE. The disease description is short, but it is supplemented with links to the OMIM database, allowing the retrieval of more detailed information about genetic disorders. Additional links to gene-specific databases can be found in the 'WEB RESOURCE' topic.

At the sequence level, close to 28'500 human single amino acid polymorphisms (SAPs) are described, more than half of which are associated with a disease state and about 30% are linked to the Single Nucleotide Polymorphism database (dbSNP). SAPs are described in the feature table and characterized by a unique identifier (FTId), which gives access to the variant web pages. These pages display a synopsis of relevant information for a given variant, including references, sequence context, as well as residue conservation throughout evolution and structural data, when available (for an example click here). Mutations that cause major changes to a protein sequence (as is the case for most frameshift mutations) are not and will not be considered to be relevant to UniProtKB/Swiss-Prot, as their deleterious effects on a given protein function is usually obvious.

Finally, our medical annotation effort also consists of the creation of keywords to allow easy retrieval of proteins involved in complex disorders and genetically heterogeneous diseases. The top 10 UniProtKB/Swiss-Prot keywords describing a disease are: deafness (105 entries), obesity (57 entries), retinitis pigmentosa (40 entries), diabetes mellitus (39 entries), cardiomyopathy (36 entries), cataract (34 entries), epilepsy (33 entries), dwarfism (32 entries), albinism (25 entries) and Charcot-Marie-Tooth disease (18 entries). Currently about 100 "medical" keywords have been created and the list is growing.

Release 50.8 of 03-Oct-2006

Rice harvest 2006: over 1'000 rice proteins annotated

Rice (Oryza sativa) is the most important food crop in the world and part of the daily diet of over half of the human population. It is grown in 114 countries worldwide and provides 50-80% of the calory consumption in a number of Southeast Asian countries (see world rice statistics).

In the current release, over 1'000 rice entries have been completed in UniProtKB/Swiss-Prot. How?

Following the completion of the first genome sequence of the model plant Arabidopsis thaliana, in 2001 the Swiss-Prot group initiated the Plant Proteome Annotation Program, which focuses on the annotation of plant-specific proteins and protein families. Our major effort was directed towards Arabidopsis, but the completion of the Oryza sativa (cultivar Nipponbare) genome sequence by the IRGSP prompted us to broaden our focus.

Each manually annotated rice entry already contains the TIGR locus identifiers - which map each protein to the corresponding gene in the rice genome - and will soon also include RAP loci. Amongst the numerous cross-references in rice entries is the link to Gramene which gives access to comparative grass genomics. We also plan to link our entries to RAP-DB in the near future, which will provide links to genomic data and genome annotation.

We are currently concentrating on the annotation of well-characterized proteins for which experimental data are available. The function of a number of rice proteins reflects physiological trait adaptation and grain property evolution owing to centuries of selection by farmers (over 100'000 rice varieties exist throughout the world).

As an example, large areas of Southeast Asia are flooded during the monsoon season. Deepwater rice copes with this by way of rapid internode elongation (up to 25 cm/day), and expansin A4 contributes by causing the cell walls to slacken and expand.

What is more, a primary factor that decreases rice crop yield is coastal salinity and the accumulation of salts in irrigated land. Pokkali, an indica variety of lowland rice, is classified as highly tolerant, because it contains a specific potassium-sodium cotransporter (HKT2), which mediates increased potassium uptake with external sodium accumulation.

Finally, grain texture of cooked rice is essential in various food cultures. A generic classification exists between long grain, medium grain and short grain rice, where the first is separate and fluffy and the last more moist, sticky and tender. The proportion of long chain amylopectin is correlated with firmer cooked rice. A starch synthase (SSII-3), which synthesizes long chain amylopectin, is barely active in the sticky cultivar japonica Nipponbare, however, a variation of 4 amino acids leads to an increased activity in firmer indica varieties.

All rice proteins annotated in UniProtKB/Swiss-Prot are classified by chromosome locus (Ordered locus name starting with "Os") and listed in the file rice.txt. In the future, we plan to manually annotate every rice gene family and to develop semi-automated annotation tools to complete rice proteome annotation.

Release 50.7 of 19-Sep-2006

In search of the origin of HIV-1: the 'missing link' revealed

The origin of Human immunodeficiency virus 1 (HIV-1) has been the subject of hot debate for more than twenty years. In 1999, American, Japanese and French researchers claimed to have discovered an indisputable link between a chimpanzee virus from central West Africa called SIVcpz (Simian Immunodeficiency Virus from chimpanzees) and HIV-1. SIVcpz is 70-90% identical to HIV-1 and does not appear to cause illness in chimpanzees.

However, since SIVcpz was only found in a few chimpanzees held in captivity, the possibility existed that another yet unidentified species could be the natural reservoir of both HIV-1 and SIVcpz.

A recent study (Science 313, 523-526 (2006)) provides for the first time a clear picture of the origin of HIV-1 and the seeds of the AIDS pandemic. New strains of SIVcpz have been identified in wild chimpanzees from Cameroon. These new strains are more closely related to human HIV-1 than to any Simian viruses.

There are three HIV-1 lineages: M (Major), O (Outlier) and N (New). The new SIVcpz isolate MB66 turned out to be more closely related to HIV-1 group M than to any Simian virus (see a similarity search for SIVcpz MB66 gag-pol protein). Moreover, another wild virus, SIVcpz isolate EK505, is very closely related to HIV-1 group N. This suggests that at least two independent SIVcpz transfers from chimpanzee to man occurred in this region. HIV-1 group M presumably crossed species early in the 20th century. HIV-1 group N may have infected humans more recently.

The authors of the study also postulate that "given the extensive genetic diversity and phylogeographical clustering of SIVcpz now recognised and the vast areas of west central Africa not yet sampled, it is quite possible that still other SIVcpz lineages exist that could pose risks for human infection and prove problematic for HIV diagnostics and vaccines."

Proteins from SIVcpz isolates MB66 and EK505 are fully annotated and available from UniProtKB/Swiss-Prot.

Release 50.6 of 05-Sep-2006

A thing of beauty is a joy forever (*)

3D-structure information is now available for over 10'000 proteins in UniProtKB/Swiss-Prot.

Protein structures not only delight the eye, they shed light on protein architecture and provide proof for the existence of a given protein fold. They are indispensable to determine the interactions of a protein with its ligands (substrates, ions, cofactors or regulatory molecules) and provide solid proof for post-translational modifications. Likewise, 3D-structures pinpoint the exact position of residues that cause a genetic disease when mutated (example:Q8NBK3). They help to design experiments and make it possible to attribute a function to so-far hypothetical proteins (Q46856).

UniProtKB aims to be fully synchronized with PDB and provide access to information about protein 3D-structures via cross-references to PDB, and by giving high priority to the annotation of proteins with known 3D-structures. A semi-automated mapping procedure was established in collaboration with the Macromolecular Structure Database (MSD), so that the whole PDB archive could be mapped to UniProtKB.

3D-structures are now available for 10'006 entries in UniProtKB/Swiss-Prot, corresponding to 36'671 individual cross-references to PDB. These entries can be retrieved by a search with the keyword '3D-structure'.

(*) From John Keats' epic poem, Endymion, 1818

Release 50.5 of 22-Aug-2006

10'000 species in UniProtKB/Swiss-Prot!

We have now 10000 different species represented in UniProtKB/Swiss-Prot for which protein entries are stored in the knowledgebase. Ten times more species are stored in UniProtKB/TrEMBL. Each species present in UniProtKB/Swiss-Prot is curated: the curation consists of the verification of the scientific name validity, the consistency of the lineage and the existence of a common name and/or synonym. You think the taxomony is indigestible? Have a look at the following recipe ;-)

Pizza recipe

Pizza is not a new program, it is really a delicious and tasteful recipe!

Pizza crust:Toppings:Homemade tomato sauce:
(*) Lactobacillus helveticus is used for the manufacture of these 2 cheeses

Add fresh Saccharomyces cerevisiae to the water and stir until dissolved. Add Beta vulgaris sugar, Olea europaea oil, salt and Triticum aestivum powder. On lightly floured board, knead dough until smooth and elastic. Place in a bowl and let rise in a warm place until volume has doubled.

Heat Olea europaea oil in a wide frying pan over medium heat; add Allium cepa and cook for about 10 minutes until softened, stirring often. Turn the heat on to high and add Allium sativum, herbs (Ocimum basilicum, Origanum vulgare and Petroselinum crispum and Lycopersicon esculentum paste. Add Capsicum annuum powder and season to taste with salt.

Let simmer for at least 30 minutes.

Roll dough into a large circle, place on greased baking sheet, press around edges to form 2 cm rim. Cover with homemade tomato sauce. Layer toppings on dough in order listed. Bake at 240°C for 13 minutes until nicely coloured. You can top the pizza with a few leaves of Diplotaxis tenuifolia (it tastes hotter than Eruca sativa).

You uncovered 18 species in our recipe but 9982 other species are now in UniProtKB/Swiss-Prot

ENJOY :)

Release 50.4 of 25-Jul-2006

Happy anniversary, Swiss-Prot!

On July 21st 1986, the first Swiss-Prot release was created. It contained close to 4'000 protein sequence entries and was produced by a single graduate student, Amos Bairoch, at the University of Geneva. In 1996, while Swiss-Prot was rapidly growing (60'000 entries) and was used worldwide, the granting agencies could not find a solution to finance it. Without the support of thousands of users, Swiss-Prot would not be celebrating its 20th anniversary today! This financial crisis was solved by the creation of the Swiss Institute of Bioinformatics (SIB), and additional resources were provided by license fees paid by commercial users, Swiss-Prot remaining freely accessible to the academic community.

The first Swiss-Prot annotators used to annotate protein sequences concomitant with the submission of the nucleotide coding sequences to the EMBL database. However, the increase of submissions made it impossible to keep pace. In collaboration with the European Bioinformatics Institute (EBI), a solution was found with the creation of TrEMBL, a computer-annotated supplement to Swiss- Prot in 1996, which contained roughly 60'000 entries in its first release.

In 2006, a staff of 60 annotators at the SIB and the EBI, supported by a dedicated programming team, is maintaining Swiss-Prot. Close to 250'000 entries are currently in the knowledgebase. Interestingly, 10 years were necessary to reach the first 50'000 protein sequence entries, while 50'000 proteins can now be manually annotated in about 18 months. In parallel, TrEMBL's exponential growth results in a database containing close to 3 millions entries.

Since 2002, both databases are at the heart of the UniProt project and together they constitute the UniProt Knowledgebase (UniProtKB), one of 3 UniProt components. UniProt is produced by a collaboration between 3 institutes, SIB, EBI and PIR (Protein Information Resource). This single, centralized, authoritative resource for protein sequences and functional information aims to make protein data available, to facilitate their retrieval and to provide new tools to help in their analysis. Since Swiss-Prot became UniProtKB/Swiss-Prot, the access to the knowledgebase is free again for commercial users. Currently 160 persons are involved in the UniProt services to the scientific community.

The means have changed, but the 20 year old key idea of a graduate student to share knowledge is still, and more than ever, vivid.

Release 50.3 of 11-Jul-2006

Of mice and men: more than 10'000 orthologous sequence pairs in UniProtKB/Swiss-Prot

Comparisons of orthologous proteins between mammalian species contributes greatly to understanding the biological basis underlying disease susceptibility or responsiveness to drugs, or simply to understanding what makes us human and not simply another great ape.

Human protein sequences and those of all available mammalian orthologous sequences are annotated and compared in the frame of the UniProtKB/Swiss-Prot HPI annotation program (Human Proteomics Initiative). During the annotation process, sequence length, alternative splicing isoforms or even polymorphisms can be validated. In order to provide our users with a coherent view of mammalian proteomes, similar isoforms are shown for orthologous proteins from all mammalian species whenever possible.

The laboratory mouse is a widely used model organism and thus many murine sequences are available for annotation. It is currently the most highly represented non-human mammal with more than 11'000 entries, and 91% of these entries are orthologous to human proteins. Human-mouse orthologous pairs share 85% identity on average. About 36% of these pairs have identical sequence length and share 94% identity. The most highly conserved proteins are involved in core biological processes such as mRNA processing and transport, translation and ubiquitin-dependent protein degradation. In contrast, fast evolving proteins generally play roles in immunity, reproduction and signal transduction.

The percentage of identity between orthologous protein pairs in the most highly represented mammals in UniProtKB/Swiss-Prot is shown in the table below:

            Orangutan   Bovine   Mouse     Rat
Human           97.43    87.37   85.46   85.80
Orangutan                89.34   87.44   87.20
Bovine                           83.99   84.80
Mouse                                    93.48

UniProtKB/Swiss-Prot entries for orthologous proteins usually share the same protein mnemonic code in the ID line and thus can be easily identified.

Release 50.2 of 27-June-2006

Looking for Titin

"I am looking for Titine" Charlie Chaplin sang in Modern Times. While for many people Titin brings back memories about this song, for the scientific community the meaning is completely different. Titin is a giant sarcomeric protein of roughly 35'000 aa. Protein analysis programs used to crash when encountering huge proteins, and the size limit of a protein to be integrated into UniProtKB/Swiss-Prot used to be under 10'000 aa long. Modern times finally arrived and bioinformatics has improved by leaps and bounds. Programs are now able to deal with huge proteins and titin has finally been integrated into UniProtKB/Swiss-Prot.

Titin is a long (up to 1 micron), slender and flexible strand, frequently with a large globule at one end. It has a complex modular structure that varies depending on the splicing events. In its longest form it may contain up to 132 fibronectin type-III domains, 152 Ig-like domains, 9 Kelch, 17 RCC1, 14 TPR, 15 WD and 31 PEVK repeats and 1 protein kinase domain. Titin functions as a mechanical sensor through its interaction with many other proteins, such as myomesins, tropomyosins, myosins, actins, myopalladin, etc. By providing connections at the level of individual microfilaments, it contributes to the fine balance of forces between the two halves of the sarcomere and thus to muscle extensibility. In non-muscle cells, it seems to play a role in chromosome condensation and segregation during mitosis.

Needless to say, the titin-seeking of Charlie Chaplin was a legitimate demand, because all human beings need titin in their life.

Release 50.1 of 13-June-2006

Man gave names to all the... proteins

We have spent many years curating all kinds of proteins from all kinds of species. One recurring challenge is to offer an easily searchable and consistent knowledgebase dealing, in particular, with many ambiguities and discrepancies regarding protein names. Nomenclature is not only indispensable for communication, but also for literature search and entry retrieval. We feel that our experience in this field can be valuable, and that we can play a role in helping the standardization of protein nomenclature.

To take up this challenge, we created a new document which describes guidelines used by UniProtKB/Swiss-Prot annotators to give each entry the most appropriate name, called the "Recommended name" (RN). In short, an RN should follow the approved nomenclature, if it exists, and should be unique and attributed to all orthologs. Other rules deal mostly with the syntax of submitted protein names in order to have consistent and reproducible RNs in spite of the variability observed in various submissions. If our RN differs from the submitted one, the latter is kept as "alternative name". In this way we enhance the searchability, as well as the consistency, of our database.

We sincerely hope that researchers will adhere as much as possible to these guidelines for naming new proteins when publishing or submitting their data. This will make their results easily searchable, allow tracking of a given protein across related organisms and help us in our continuing effort to standardize nomenclature.

Release 50.0 of 30-May-2006

New major release is available (50.0)

Release 50.0 of 30-May-2006 of UniProtKB/Swiss-Prot contains 222'289 sequence entries, comprising 81'585'146 amino acids abstracted from 142'438 references.

15'220 sequences have been added since release 49.0, the sequence data of 953 existing entries has been updated and the annotations of 190'604 entries have been revised. This represents an increase of 8%.

Many improvements were carried out in the last 3 months:

All the recent changes to the UniProt Knowledgebase format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 8.0 includes Swiss-Prot release 50.0 and TrEMBL release 33.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 49.7 of 16-May-2006

Venomous animals, their toxins and UniProtKB/Swiss-Prot

In order to provide the scientific community with a summary of the current knowledge on animal protein toxins, the Swiss-Prot group initiated the Tox-Prot annotation program. The aim of this program is the annotation of all toxin proteins produced by venomous animals, such as snakes, scorpions, spiders, jellyfish, insects, cone snails, sea anemones, lizards, some fish, and platypus.

Toxins are small (usually less than 100 amino acids) and extremely stable. They undergo numerous post-translational modifications and they have very specific targets. The 3D-structure of about 15% of the known toxins has been unravelled, which provides clues to the understanding of their specificity. Many toxins, such as those synthetized by cone snails (conotoxins), can be used as drugs and some are presently tested in clinical trials.

At the level of annotation and ultimately for the sequence retrieval by our users, the lack of a systematic nomenclature represents a real problem. This issue is currently being addressed by the scientific community and an official nomenclature has been developed for potassium channel scorpion toxins: Tytgat et al. (1999) and Rodriguez de la Vega and Possani (2004). With the help of Dr. Ricardo C. Rodriguez de la Vega and Prof. Lourival D. Possani, we have created a document which provides links between the official nomenclature and the associated UniProtKB/Swiss-Prot entries.

Finally, we would like to draw your attention on the fact that many toxin sequences are not submitted to any databases and have to be manually retrieved by Swiss-Prot annotators. This step slows down the annotation process itself and may be error-prone. We thus would like to encourage researchers to share their data by submitting them public databases, EMBL/GenBank/DDBJ for nucleic acid sequences and UniProtKB/Swiss-Prot for protein sequences. Experimental data can also be directly submitted by email. For any comments or suggestions concerning the Tox-Prot annotation program, don't hesitate to contact us.

Release 49.6 of 02-May-2006

5'000 rat entries in UniProtKB/Swiss-Prot

While in imperial China the rat was associated with creativity, honesty and generosity, western culture tends to see them as vicious, unclean, parasitic animals that steal food and spread disease. Whatever your feelings towards this small rodent are, for modern biologists rats have proved to be a good animal model for many human diseases, such as diabetes, arthritis and cardiovascular diseases. Despite its importance for medical research and although its genome sequence has been published in 2004, the amount of sequence data available is still much smaller than for the other two best studied mammals, namely human and mouse. The number of rat ESTs at the NCBI is only 11% of that of human and 18% of that of mouse. Not many high-throughput cDNA sequencing projects have been initiated and, in the NIH Mammalian Gene Collection, rat sequences represent less than one quarter of human ones. As a result, this trend is also observed in UniProtKB/Swiss-Prot, which is highly dependent upon submissions of sequence data to the public DNA sequence databases EMBL/GenBank/DDBJ. In UniProtKB/Swiss-Prot, rat is the third best represented mammal, after human and mouse. With more than 5'000 entries, it is still underrepresented compared to human and mouse. However, new rat entries are continuously integrated in order to represent all mammalian orthologs of human proteins. Our final aim is to provide our users with a complete set of rat proteins.

Release 49.5 of 18-Apr-2006

ComX, a pheromone involved in a major quorum sensing system in bacilli, is post-translationally modified by strain-specific prenylation

In order to acquire genetic competence once the cell density has reached a critical threshold, bacteria have developed a sophisticated quorum-sensing system. This system proceeds through the release of a pheromone, comX, that activates a two-component system, which eventually propagates the signal into the cell. Both components of this system have been identified: they are the sensor histidine kinase, comP, and the response regulator, the transcription factor comA.

The crucial pheromone ComX is produced as an inactive precursor which is activated by 2 post-translational modifications (PTM): the prenylation of a conserved tryptophan residue and a proteolytic cleavage. The protein comQ is thought to catalyze the maturation of comX. comX and comQ sequences show striking variability among different strains, as do the prenyl derivatives. The mass of the prenyl groups linked to comX has been determined by mass spectrometry. Surprisingly, three different masses were observed: 120Da, 136Da and 205Da, depending on the strain studied. The 136 and 205Da forms are thought to consist of farnesyl and geranyl groups, respectively. The structure of the 205Da prenyl group in strain W23 / RO-E- 2 was recently obtained (Q8VL79). It consists of a cyclic tryptophan bound to a geranyl group. Interestingly enough, the nature of the prenyl group was shown to depend on the comX sequence itself rather than on the origin of the modifying enzyme comQ.

To our knowledge, this is the first report of a post-translational prenylation catalyzed by a bacterial enzyme, despite the universal availability of the necessary isoprenoid substrates and the existence of various other lipid-modified proteins in this kingdom. The exact function of comX prenylation is not known, but, by analogy with the situation in eukaryotes, it may provide anchoring to membrane structures.

Of note, this PTM, like all other PTMs in UniProtKB/Swiss-Prot, are annotated entry using controlled vocabulary.

Release 49.4 of 04-Apr-2006

A fly's eye view of UniProt - the kaleidoscope that is the proteome

The keyword Complete proteome is now added to UniProtKB Drosophila melanogaster entries. This is the second metazoa to have the keyword added; the other one being Caenorhabditis elegans. Eleven other eukaryotes have the keyword added; ten complete fungal genomes and Plasmodium yoelii yoelii. The presence of this keyword allows easy retrieval of a complete non-redundant set of proteins from the Drosophila melanogaster genome (nuclear and mitochondrial) across the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase. To add the keyword, all fruit fly UniProtKB/Swiss-Prot entries have been updated for addition of the genome project reference (Adams et al, 2002, Science 287:2185-2195), along with other relevant updates as appropriate.

UniProtKB Release 7.4 has 2361 Drosophila melanogaster entries in UniProtKB/Swiss-Prot and 25453 entries in UniProtKB/TrEMBL. Addition of the keyword 'Complete proteome’ will allow the retrieval of the complete nonredundant proteome consisting of 16229 entries, 2329 from UniProtKB/Swiss-Prot and 13900 from UniProtKB/TrEMBL. The proteome can be downloaded from ftp://ftp.expasy.org/databases/complete_proteomes/entries/eukaryota/DROME.dat or from Integr8.

Release 49.3 of 21-Mar-2006

Annotation of Chikungunya virus in UniProtKB/Swiss-Prot

The Chikungunya virus has made a severe outbreak in French island of Réunion and also in Mauritius, Seychelles, Mayotte and Madagascar, all located off the southeast coast of Africa. The virus is not deadly, but causes severe fever, rash, arthritis and joint pain. These symptoms are at the origin of the name Chikungunya, which means in Swahili "that which bends up".

The virus belongs to the large family of Togaviridae, genus Alphavirus. This family includes exotic viruses like O'nyong nyong and Igba Oro, which are very closely related to Chikungunya and induce actually the same disease in humans. Interestingly, the name O'nyong nyong comes from the Nilotic language of Uganda and Sudan and means "weakening of the joints".

Chikungunya is transmitted by mosquitos, in which it infects salivary glands, but the natural host reservoir is constituted of different types of monkeys.

The molecular strategy used by Alphaviruses to replicate and hijack cellular defense is very surprising:

After virus entry into the target cell, the mRNA(+) genome is translated into a nonstructural polyprotein, which starts discretely to replicate the genome in the cytoplasm. After this early phase where the virus avoids cellular defense by restraining its activity, the nonstructural polyprotein is processed into four proteins. There goes the virus at full strength to replicate large amount of his genome, and innocently transcribes also a subgenomic 26S RNA. This replication has a drawback: it creates dsRNA by genome and antigenome hybridization.

No eukaryotic cell can accept such an offence: dsRNA is a signature of viral infection. The host cell reacts violently: human PKR is strongly activated by the dsRNA, resulting in a complete shutoff of cellular translation through inactivation of early initiation of translation factor EIF2A.

But this powerful cellular defense was expected by the virus. The 26S mRNA possesses a unique feature in biology: an enhancer element which allows the mRNA to be translated independently of EIF2A. This 26S RNA codes for the structural proteins, which are now the only proteins synthesized by the cell! These structural proteins form new virions which bud from the doomed cell to find new targets.

The following are examples of new Alphavirus entries:

Chikungunya virus (strain S27-African prototype): Nonstructural polyprotein: Q8JUX6, Structural polyprotein: Q8JUX5
O'nyong-nyong virus (strain Gulu): Nonstructural polyprotein: P13886, Structural polyprotein: P22056
Togavirus prototype - Semliki forest virus: Nonstructural polyprotein: P08411, Structural polyprotein: P03315

Release 49.2 of 07-Mar-2006

Human coagulation factor IX: the most frequently updated entry in UniProtKB/Swiss-Prot

The most obvious way to quantify the work done by UniProtKB/Swiss-Prot is to count the increase in new entries. However, the integration of new entries is only part of the annotation work. Providing the scientific community with high quality data also - and maybe mostly - involves time-consuming updates of older entries. Thanks to the introduction of sequence and annotation version numbers and to the creation of the UniProtKB Sequence/Annotation Version Database, it is now possible to know when and how a UniProtKB/Swiss-Prot entry has been updated.

Sequences shown at the bottom of each entry are relatively stable. In 83% of entries, the sequence has not been updated since its integration in the knowledgebase, in 15% of the entries, it has been updated once. So far, the maximal number of sequence updates is 6 times and this is observed only in 8 entries. By contrast, the annotation has to be constantly reviewed.

Currently the average UniProtKB/Swiss-Prot entry has been updated more than 50 times. The most frequently updated entry is human coagulation factor IX which has been reviewed 103 times while its sequence has been updated only once. The human coagulation factor IX entry was created in the first UniProtKB/Swiss-Prot release in July 1986. Since then, over 50 references have been added, mostly dealing with polymorphisms and disease-causing mutations. This is also reflected at the level of the feature table, where the number of described variants had risen from 1 to 145, most of them associated with hemophilia. While the presence of gamma-carboxyglutamate was already well-established 20 years ago and the sites of N-glycosylation suspected, other post-translational modifications, such as sites of O-glycosylation, phosphorylation and sulfation were described, and thus annotated, later. Information on secondary structure was added to the entry in 1994, as well as the first link to the 3D structure submitted to PDB. Today 7 cross-references to PDB are provided.

Science is going forward and we, at Swiss-Prot, are doing our best to keep pace. Nevertheless, we need the user community to help us in this task. All our entries are equipped with a "Submit update" button and we greatly encourage you to use it every time your favourite protein is not up-to-date in UniProtKB/Swiss-Prot, or if it is not yet integrated.

Release 49.1 of 21-Feb-2006

Over 25'000 protein polymorphisms annotated in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot pays a particular attention to the annotation of protein polymorphisms as well as disease mutations, due to their importance for the understanding of genetic diseases. Information on genetic variations and diseases is annotated in the comment section under the topic CC DISEASE and in the feature table using the key FT VARIANT. Literature reports used for data extraction are also cited in the entry, and data are manually checked prior to integration into the knowledgebase. Links to OMIM are provided whenever possible. As UniProtKB/Swiss-Prot is a 'proteocentric' resource, we do not annotate frameshifts or nonsense mutations as their deleterious effect on the protein is usually obvious. We therefore concentrate on tracking and storing data on amino acid substitutions, small deletions or insertions.

We have currently reached a total of 25'255 variants in 4'196 human sequences: 98% of the variants are single amino acid polymorphisms (SAP). Association of a variant with a disease is annotated according to literature reports. In the current release, 13'581 SAPs are disease-associated, 9'451 are neutral polymorphisms and 1'816 are unclassified.

Release 49.0 of 07-Feb-2006

New major release is available (49.0)
Release 49.0 of 07-Feb-2006 of UniProtKB/Swiss-Prot contains 207'132 sequence entries, comprising 75'438'310 amino acids abstracted from 139'151 references. 12'815 sequences have been added since release 48, the sequence data of 991 existing entries has been updated and the annotations of all entries have been revised. This represents an increase of 7%.

Many improvements were carried out in the last 5 months. In particular, we have changed from showing only the dates corresponding to full UniProtKB releases in the DT lines to displaying the date of the biweekly release at which an entry is integrated or updated. We dropped the information concerning the release number and introduced entry and sequence version numbers in the DT lines.

Cross-references to several databases have been added, and we have changed our copyright statement.

All the recent changes to the UniProt Knowledgebase format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 7.0 includes Swiss-Prot release 49.0 and TrEMBL release 32.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 48.9 of 24-Jan-2006

Mammalian, Xenopus and Zebrafish Gene Collections: a goldmine for high-quality sequences
High-quality nucleotide sequences derived from high-throughput sequencing projects, such as those generated by the NIH Gene Collection (GC) initiatives are extremely valuable for a protein sequence database, like UniProtKB/Swiss-Prot. More than 99.98% of the UniProtKB/Swiss-Prot sequences are generated by translation of nucleotide sequences rather than direct protein sequences. In this context, high-quality nucleotide sequences provide a rapid and easy way to control the accuracy of the sequences. Differences between sequences may point at the existence of polymorphisms and many alternative splicing isoforms have been introduced thanks to these projects.

Launched in 1999, the Mammalian Gene collection (MGC) is a NIH multi-institutional initiative. Its goal is to identify and sequence cDNA clones containing a full-length open reading frame. Initially aimed at human and mouse sequences, it was further expanded to rat and bovine clones. Two additional projects enriched the first initiative, these deal with Xenopus (XGC) and Zebrafish (ZGC). The sequences obtained by these projects are submitted to the EMBL/GenBank/DDBJ databases, the submitted CDS are translated and automatically integrated into UniProtKB/TrEMBL. The UniProtKB/TrEMBL entries can then be manually annotated and integrated into UniProtKB/Swiss-Prot. Following the principle of non-redundancy, sequences derived from the same gene in the same species are merged into one UniProtKB/Swiss-Prot entry. This is reflected at the level of crossreferences. For instance, currently, the average number of distinct nucleotide sequence cross-references per human entry is close to 5. This implies that each human sequence has been confirmed, on average, by 5 independent submitted sequences, and thus the accuracy of the sequences shown in UniProtKB/Swiss-Prot entries is quite high.

Currently, close to 16'000 UniProtKB/Swiss-Prot entries contain data from GC submissions. Considering the various species involved, it means that MGC data are found in more than 60% of the human entries, more than 50% of mouse entries, 25% of rat entries, but only 2% of bovine entries. ZGC data can be found in close to 55% of zebrafish entries and XGC in close to 20% of Xenopus laevis entries and 85% of Xenopus tropicalis entries.

Release 48.8 of 10-Jan-2006

Major update of influenza A viruses: H5N1 pathogenicity
Influenza A viruses are named depending on their surface protein subtype, H for hemagglutinin and N for neuraminidase. There are 16 known H subtypes and 9 known N subtypes for influenza A virus, all of them infect birds, a few such as H1N1, H1N2 and H3N2 can infect human.

'Avian influenza' is used to name viruses commonly restricted to birds, such as H5,H7,H9,... subtypes. Most avian influenza subtypes cause very mild diseases, but the H5 and H7 subtypes can cause outbreaks involving massive deaths in domestic poultry. During these outbreaks, sporadic transmission to human has been reported. Fortunately humans are dead end hosts for these viruses, i.e. infected humans do not transmit the virus. Although few human cases of H7N7 and H9N2 have been documented, the major threat remains the H5N1 subtype.

H5N1 is not a new virus, it was isolated on birds in Scotland back in 1959 (hemagglutinin: P09345). It became famous after the first big outbreak in 1997 in Hong Kong, where 1.5 millions of poultry were affected and destroyed, and 18 human cases occurred, six of whom died (hemagglutinin: O56140). This was the first time an avian influenza A virus transmission directly from birds to humans had been found.

In 2003 two cases of H5N1 occurred in Hong Kong, one fatal. How or where these two family members were infected was not determined. In 2004 and 2005, severe outbreaks happened in Thailand, Vietnam, Cambodia and Indonesia, for a total of 130 human cases, 70 of whom died. Most of these cases occurred as a result of people having direct or close contact with infected poultry, however a few cases of human-to-human spread of H5N1 have occurred.

Why is H5N1 so deadly in poultry and humans? Presumably because of small sequence variations in hemagglutinin.

Hemaglutinin is present at the virion surface, and its function both to bind cellular receptor and induce fusion of viral and target cell membrane. In order to be able to promote fusion, the protein must be cleaved. In common influenza A viruses, the cleavage site is specific to proteases present in the respiratory tract. Hence influenza is restricted to infect this organ.

H5 and H7 have a completely different cleavage site, rich in arginine and lysine residues (RRRKKR in Hong Kong 1997: O56140), which can be processed by ubiquitous proteases: furins. This result in an infection of almost all host organs, and an acute pathology which can be quickly fatal.

Few antiviral drugs are effective against influenza. Zanamivir (Relenza) and oseltamivir (Tamiflu) are inhibitors of the neuraminidase (e.g. Q9W7Y7), amantadine and rimantadine are inhibitors of ion channel M2 protein (e.g. O70632). Unfortunately drug resistance evolves rapidly, and already a case of H5N1 resistant to Tamiflu has been reported (Nature 437:1108-1108(2005)).

The following are examples of updated influenza entries:

H5N1 isolated from human, in Hong Kong 1997:
Hemagglutinin: HEMA_IAHO3 (O56140)
Neuraminidase: NRAM_IAHO3 (Q9W7Y7)
M1: M1_IAHO3 (Q77Y95)
M2: M2_IAHO3 (O70632)
NS1 : NS1_IAHO3 (O56264)
NEP: NEP_IAHO3 (O56263)
PB1-F2 : PB1F2_IAHO3 (P0C0U0)
Nucleoprotein: NCAP_IAHO3 (O92784)
PA : PA_IAHO3 (O89752)
PB1 : RDRP_IAHO3 (Q9WLS3)
PB2: PB2_IAHO3 (O56266)

H5N1 isolated in 1959 on chicken:
Hemagglutinin: HEMA_IACKS (P09345)

Release 48.7 of 20-Dec-2005

"De-merge" of multi-species in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot as a non-redundant protein database used to "merge" entries originating from different species, if there were 100% conserved. In merged entries, information about the source of each organism was noted in the OS (Organism Species) lines, e.g. actin, P03996 (ACTA_HUMAN):

OS   Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat),
OS   Bos taurus (Bovine), and Oryctolagus cuniculus (Rabbit).
However, the OC (Organism Classification) lines only contained the taxonomy of the first listed species, and the "species part" of the entry name was built on the first organism in the list ("_HUMAN").

As the type of information on proteins has greatly evolved, and more and more data have been documented that are species specific, Swiss-Prot had to adapt and change its merging policy. While it may seem to contradict the principle of non-redundancy on the sequence level to create two or more entries for an identical sequence, this does make sense from the annotation point of view. The new policy allows to clarify which information item has been proven for which organism. Even if a protein has the same sequence in two or more different organisms, there may be evidence for different post-translational modifications, sequence variants, alternative splicing, protein-protein interactions, tissue specificity, and implication in diseases. Moreover, since some organism-specific scientific communities use different gene name nomenclatures, it is important to reflect such species-specific nomenclature usage.

With this release, we have completed the de-merging of all the UniProtKB/Swiss-Prot entries (almost 6'000) that contained information relative to two or more distinct species.

The primary accession number of a formerly merged entry has been retained as a secondary accession number in all of the resulting de-merged entries. A new primary accession number has been attributed to all de-merged entries.

In the example above: ACTA_HUMAN (old primary AC: P03996, old secondary AC: P04108) has been de-merged into:

entry namenew primary ACsecondary ACs
ACTA_BOVINP62739P03996 P04108 Q862W5
ACTA_HUMANP62736P03996 P04108
ACTA_MOUSEP62737P03996 P04108
ACTA_RABITP62740P03996 P04108
ACTA_RATP62738P03996 P04108 P70476

Release 48.6 of 06-Dec-2005

200'000 entries in UniProtKB/Swiss-Prot

The Swiss-Prot group is happy to announce that a total number of 200'000 manually annotated entries has been reached in UniProtKB/Swiss-Prot. It took 15 years and 2 months to reach the first 100'000 entries in September 2001, but only 4 years and 2 months to reach 200'000 entries. The first (P99999) and the 200'000th (Q52V10) entries deal with human and common squirrel monkey cytochrome c, respectively. Both were created by Amos Bairoch, who founded the knowledgebase, the first in July 1986 and the last in November 2005.

We would like to acknowledge our users, who provide continous support by suggesting entry updates, by sharing their expertise in order to increase annotation quality or simply by using UniProtKB/Swiss-Prot.

Many thanks also to all annotators, programmers, system administrators, administrative support persons, members of the UniProt Consortium, who have contributed to this major achievement.

Release 48.5 of 22-Nov-2005

Keyword hierarchies and categories

We have changed the structure of the UniProtKB keyword list, and would like to take this opportunity to describe some concepts behind the use of the keywords in UniProtKB/Swiss-Prot.

UniProtKB/Swiss-Prot entries are tagged with keywords. Keywords help summarize the contents of individual entries, simplify retrieval of sets of entries, and allow entries to be grouped easily according to different aspects such as biological processes, molecular function, subcellular location, domains, ligands, sequence modifications and diseases.
The keywords are described in the keywlist.txt file using the following format:


---------  ---------------------------     ----------------------
Line code  Content                         Occurrence in an entry
---------  ---------------------------     ----------------------
ID         Identifier (keyword)            Once; starts an entry
AC         Accession (KW-xxxx)             Once
DE         Definition                      Once or more
SY         Synonyms                        Optional; Once or more
GO         Gene ontology (GO) mapping      Optional; Once or more
HI         Hierarchy                       Optional; Once or more
CA         Category                        Once
//         Terminator                      Once; ends an entry

Example of a complete keyword description:

ID   Calcium channel.
AC   KW-0107
DE   Cell membrane glycoprotein forming a channel in a biological membrane
DE   selectively permeable to calcium ions. Calcium is essential for a
DE   variety of bodily functions, such as neurotransmission, muscle
DE   contraction and proper heart function.
GO   GO:0005262; calcium channel activity
HI   Molecular function: Ionic channel; Calcium channel.
HI   Biological process: Transport; Ion transport; Calcium transport; Calcium channel.
HI   Ligand: Calcium; Calcium channel.
CA   Molecular function.
//
Some keywords are by definition supersets or subsets of others. Such hierarchical relationships are stated in HI lines:
HI   Category: Keyword(1); ...; Keyword(n); Described keyword.
From the previous example we can infer that a UniProtKB/Swiss-Prot entry that is tagged with the keyword "Calcium channel" will at least have the following additional keywords appear in the KW line:
KW   Calcium; Calcium transport; Ion transport; Ionic channel; Transport.
This formalization of the relationships between keywords enables our curators (assisted by automated procedures) to ensure coherence, and to increase the coverage of UniProtKB/Swiss-Prot entries which keywords describing both specific and more general concepts. This in turn facilitates the retrieval of complete and coherent entry sets by keyword. The current UniProtKB/Swiss-Prot release contains close to one million keywords in almost 200'000 entries.

A "Category" is a top-level keyword that never appears directly in UniProtKB/Swiss-Prot entries. Categories are described along with the other keywords, but are introduced by an IC rather than an ID line using the following format:

---------  ---------------------------     ----------------------
Line code  Content                         Occurrence in an entry
---------  ---------------------------     ----------------------
IC         Identifier (category)           Once; starts a category entry
AC         Accession (KW-xxxx)             Once
DE         Definition                      Once or more

Example of a category description:

IC   PTM.
AC   KW-9991
DE   Keywords assigned to proteins because their sequences can differ from
DE   the mere translation of their corresponding genes, due to some post-
DE   translational modification.

Release 48.4 of 08-Nov-2005

A city-sized crowd of authors

UniProtKB/Swiss-Prot is a manually annotated protein knowledgebase. This involves not only sequence curation, but also a critical review of the scientific literature. All references used to create an entry are always cited, whether they are complex publications or simple submissions. Currently, there are more than 1'600 journals referenced in UniProtKB/Swiss-Prot, and more than 210'000 distinct authors [author index].

Bringing all these authors together for a meeting would result in a gathering of a size similar to that of Geneva, the city where UniProtKB/Swiss-Prot was created and is still based... A small, Swiss city with an international vocation, a little like UniProtKB/Swiss-Prot, which began as a local project of a graduate student with quite an ambitious aim: providing the scientific community throughout the world with a central hub for sharing biological knowledge. To achieve this goal, the Swiss-Prot group very quickly developped a strong and fruitful collaboration with the European Institute of Bioinformatics (EBI) and more recently with the Protein Information Resource (PIR) of the Georgetown University Medical Center in USA (http://www.uniprot.org/). The Swiss-Prot group is not only international but also interdisciplinary. Various educational backgrounds are mixed: biologists, biochemists, programmers, mathematicians, wet lab experts or students, etc. This team work is what ensures the quality of the knowledgebase.

Release 48.3 of 25-Oct-2005

Over 100'000 prokaryotic entries in UniProtKB/Swiss-Prot

We have reached 100'000 prokaryotic (bacterial and archaeal) entries in UniProtKB/Swiss-Prot. Of these, just over 10'000 are archaeal entries. To deal with the enormous increase in the amount of available prokaryotic protein sequences, the Swiss-Prot group started the HAMAP project, which aims to automatically annotate, with a high-throughput but with no decrease in quality, proteins from complete microbial genomes that belong to a family (well-defined or uncharacterized).

The HAMAP annotation system is based on manually curated family rules, which contain the information, derived from searches of the available literature, that can be safely propagated to all members of the family. Profiles are generated from an alignment of seed members; these profiles, in turn, are used to scan all available sequences in Swiss-Prot and TrEMBL to identify family members. The goal of the system is not to search for distant sequence similarities, but to annotate only the proteins that can be conservatively assigned to a HAMAP family. Cases and conditions are included in most rules so that warnings are generated if some conserved features, such as active sites or metal-binding amino acids, are not present in a given protein sequence, or if there are other problems, such as size or taxonomic range. All entries that contain warnings are subjected to manual verification. In fact, since the implementation of the system and for the time being, ALL the proteins that have been annotated using the HAMAP family rules have been manually verified to check the reliability of the HAMAP annotation module.

More than 1'200 family rules are available, and almost 70'000 prokaryotic entries belong to one (or more) HAMAP families. From the HAMAP website it is possible to scan protein sequences for matches against HAMAP families, and it is also possible to submit a whole proteome, by confidential ftp, to be scanned against the collection of HAMAP family rules.

Release 48.2 of 11-Oct-2005

Albumin: the most popular entry in UniProtKB/Swiss-Prot

With 1'997 clicks in September 2005 from 702 different sites, albumin can be considered as the most popular protein in UniProtKB/Swiss-Prot. Albumin is also the most abundant plasma protein with 35-50 g/l (75% of protein molecules in plasma), a concentration superior by a factor of 1010 to that of cytokines, such as interleukin-6. This broad range of protein concentration in plasma makes identification of low abundance proteins quite a difficult task, as finding an individual human being by searching through the population of the entire world. Albumin is a multifunctional protein with ligand-binding and transport properties, antioxidant functions and enzymatic activities. Physiologically, it is responsible for maintaining colloid osmotic pressure and may influence microvascular integrity and aspects of the inflammatory pathway, including neutrophil adhesion and the activity of cell signaling moieties (for a review, see PubMed: 15915465).

In our "Hit Parade", albumin is followed by p53 (P04637) (1'862 clicks from 430 different sites), the EGF receptor (P00533) (1'414 clicks from 316 different sites) and insulin (P01308) (1'162 clicks from 361 different sites).

We are regularly analyzing our server access logs in order to determine annotation priorities. In particular, the most frequently requested entries from UniProtKB/TrEMBL are queued for manual annotation and "promotion" into UniProtKB/Swiss-Prot.

Release 48.1 of 27-Sep-2005

10'000 mouse entries

The threshold of 10'000 mouse entries is about to be reached in UniProtKB/Swiss-Prot. 35% of them contain information about isoforms generated by alternative splicing. When these are taken into account, the total number of mouse sequences in UniProtKB/Swiss-Prot is close to 13'000. The second most represented rodent is UniProtKB/Swiss-Prot is rat with about 4'600 entries.

The majority of the mouse entries contain sequence data generated by one of the two high-throughput cDNA sequencing projects: the NIH Mammalian Gene Collection (MGC) and the RIKEN (Rikagaku Kenkyusho, Institute of Physical and Chemical Research) mouse full-length cDNA encyclopedia. 97% of the entries also include cross-references to the Mouse Genome Informatics (MGI) database which provides access to additional data on the genetics, genomics and biology of the laboratory mouse.

Our aim is to annotate all mouse proteins along with the orthologous sequences in other mammals, especially homo sapiens, in order to provide a complete and comprehensive view of mammalian proteomes.

Release 48.0 of 13-Sep-2005

New major release is available (48.0)
Release 48.0 of 13-Sep-2005 of UniProtKB/Swiss-Prot contains 194'317 sequence entries, comprising 70'391'852 amino acids abstracted from 133'723 references. 11'963 sequences have been added since release 47, the sequence data of 1'095 existing entries has been updated and the annotations of 93'692 entries have been revised. This represents an increase of 7%.

Many improvements were carried out in the last 4 months. In particular, we have expanded our system of feature identifiers (FTIds): Feature keys concerning protein processing (CHAIN, PEPTIDE, PROPEP) have bee n tagged by a new feature identifier with the prefix PRO. We also changed the format of the OG Chloroplast and Cyanelle lines, to be able to indicate more precisely the kind of plastid organelle.

All the recent changes to the UniProt Knowledgebase format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 6.0 includes Swiss-Prot release 48.0 and TrEMBL release 31.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 47.8 of 30-Aug-2005

Escherichia coli inner membrane proteome

We have integrated into UniProtKB/Swiss-Prot the results obtained by Gunnar von Heijne's group on the inner membrane proteome of Escherichia coli (see Daley D.O. et al. , Science 308:1321-3, 2005; PubMed ID: 15919996).

von Heijne's group has applied the PhoA/GFP fusion approach to derive topology models for almost the entire E. coli inner membrane proteome. More than 500 entries concerning membrane proteins had their subcellular location updated and the topology added.

Integral membrane proteins account for the coding capacity of 20 to 30% of the genes in typical organisms and are critically important for many cellular functions. However, owing to their hydrophobic and amphiphilic nature, membrane proteins are difficult to study, and they account for less than 1% of the known high-resolution protein structures. Overexpression, purification, biochemical analysis, and structure determination are all far more challenging than for soluble proteins, and membrane proteins have rarely been considered in proteomics or structural genomics contexts to date.

Release 47.7 of 16-Aug-2005

Integration of data from an enzyme genomics project

We have integrated into UniProtKB/Swiss-Prot the results of Aled Edwards' and Alexander Yakunin's group which were summarized in FEMS Microbiology Reviews 29:263-279 (2005) (PubMed: 15808744). Using general enzymatic assays to screen individually purified proteins for enzymatic activity, they have identified activity for 36 previously uncharacterized proteins of E.coli, T. maritima, T. acidophilum, M. jannaschii and P. aeruginosa.

The sequencing of complete genomes produce increasing number of CDSs which are annotated as "hypothetical proteins". Approximately 40% of the protein sequences deposited in databases do not have any characterized function. This hinders the progress and research in many areas ranging from genome annotation to metabolic engineering. It is therefore of fundamental importance to carry on with experimental verification of function of these proteins, and, equally important, to integrate the results into the database. One of the major priorities in Swiss-Prot is to be up-to-date with respect to this kind of new findings, and we strive to integrate new characterizations as quickly as possible. We urge all groups obtaining these results to submit update requests to us. We will treat these requests with the highest priority.

Release 47.6 of 2-Aug-2005

The dramatic outbreak of Severe Acute Respiratory Syndrome virus may be due to two mutations in the virus spike protein

The SARS coronavirus is a new human pathogen that emerged in Asia in 2002-2003. The animal reservoir of the virus is presumably palm civet, whose meat is a delicacy in Southern China.

The virus induces an acute respiratory distress in human and is deadly in 10% of all cases. It enters pulmonary cells through binding of the viral spike protein (human isolate Tor2 and palm civet isolate SZ3: P59594) to angiotensin-converting enzyme 2 (ACE2) (palm civet: Q56NL1; human: Q9BYF1). These proteins have recently been annotated or updated in UniProtKB/Swiss-Prot.

It has been recently shown that two amino acid mutations on palm civet SARS spike protein, Lys-479 and Ser-487, are sufficient for the virus to acquire the ability to bind efficiently human ACE2 (see EMBO J. 24:1634-1643(2005); PubMed: 15791205).

The severity of the 2002-2003 epidemic was presumably due to those two amino-acid mutations, giving opportunity to an animal virus to cause a major infection in the human species.

Release 47.5 of 19-Jul-2005

Orangutan, the most represented non-human primate in UniProtKB/Swiss-Prot

With more than 500 entries, Pongo pygmaeus (Orangutan) is now the most represented non-human primate. Most of these entries are built around sequence data generated by a cDNA sequencing project launched by the German cDNA Consortium in 2003. Almost 4'000 entries submitted by the consortium are still in UniProtKB/TrEMBL. It should be noted however that some of these sequences are fragments and are thus not a priority for UniProtKB/Swiss-Prot annotation. We plan to manually annotate as many Orangutan entries as possible, starting with sequences orthologous to human ones already described in Swiss-Prot.

There are currently almost 15'500 primate entries in Swiss-Prot, 82% of which describe human proteins.

Release 47.3 of 21-Jun-2005

Hydrogenosomal genome encoded proteins

It was recently found (see Nature 434:74-79(2005); PubMed=15744302) that some anaerobic ciliates such as Nyctotherus ovalis (which thrives in the hindgut of cockroaches!) have retained a rudimentary hydrogenosomal genome. Hydrogenosomes are double-membraned subcellular structures that generate hydrogen while making the energy-storage compound ATP. They are found in certain eukaryotic unicellular organisms that inhabit oxygen-deficient environments.

The hydrogenosomal genome of N.ovalis is only 14 kb long and seems to encode for 11 different proteins, among which 5 subunits of the NADH dehydrogenase (complex I) and 2 ribosomal proteins. The genome and the proteins encoded are highly similar to their mitochondrial genome-encoded counterparts, thus establishing an evolutionary link between mitochondria and hydrogenosomes.

We are in the process of annotating the proteins from the N.ovalis hydrogenosomal genome.

This is linked with the introduction of "Hydrogenosome" in the list of valid values in the OG line.

Release 47.2 of 7-Jun-2005

Uncleaved N-terminal translocation signals

The translocation of a protein to another subcellular compartment requires the existence of at least one translocation signal specific to the relevant trafficking mechanism across the membrane. Proteins destined for secretion, incorporation into the plasma membrane, chloroplast, cyanelle, microbodies or the mitochondrial matrix usually possess an N-terminal transfer signal, which is cleaved during the transfer process. Recently, some proteins have been found to obviously get around this cleavage step. In some cases, the uncleaved signal peptide even confers important functional properties to the protein (P27169).

In the current Swiss-Prot release, the annotation of 23 protein entries indicates an uncleaved signal sequence (e.g. O95445) and the transit peptide of the mitochondrial 3-ketoacyl-CoA thiolase was reported to be not removed (P42765).

Release 47.1 of 24-May-2005

Average of more than 10 cross-references per UniProtKB/Swiss-Prot entry

As described in the UniProt Knowledgebase user manual, integration with other data resources is one of the priorities of UniProtKB/Swiss-Prot.

This is reflected in the high number of DR lines: UniProtKB/Swiss-Prot currently contains more than 1.8 million explicit cross-references, which translates to just over 10 links per Swiss-Prot entry. 68 external databases are referenced in this manner, in addition to the 32 databases to which we link via implicit links, created on the fly by the ExPASy server. All these resources are listed in the List of databases cross-referenced in Swiss-Prot.

The most represented type of cross-references are the ones to the family and domain classification databases, i.e. InterPro and its member databases, as well as, obviously, the nucleotide sequence database (DR EMBL), our main source for sequence data.

Release 47.0 of 10-May-2005

New major release is available (47.0)
Release 47.0 of UniProtKB/Swiss-Prot contains 181'571 sequence entries, comprising 65'742'349 amino acids abstracted from 128'438 references. 11'531 sequences have been added since release 46, the sequence data of 841 existing entries has been updated and the annotations of 166'572 entries have been revised. This represents an increase of 6%.

Many improvements were carried out in the last 3 months. In particular, we have introduced 5 new feature keys in order to better describe different types of regions in a protein sequence, and we added an additional qualifier to our cross-references to nucleotide sequence databases, the molecule type.

All the recent changes to the Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 5.0 includes Swiss-Prot release 47.0 and TrEMBL release 30.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 46.6 of 26-Apr-2005

1 million cysteine residues in the Swiss-Prot section of UniProtKB
The total number of cysteine residues in Swiss-Prot has reached the 1 million mark. While there is nothing special about this number, we thought it was interesting in the context of the natural bias in the amino composition of proteins. As shown in the biweekly release statistics, there is more than a 8-fold difference between the frequency of the rarest amino acid (tryprophan at 1.15%) and that of the most frequent one (leucine at 9.64%). There are a number of reasons for this compositional bias, one of which is the degeneracy of the genetic code (which allows from 1 to 6 different triplets to code for a specific amino acid), and another one is the prevalence of hydrophobic aliphatic residues such as leucine or isoleucine in transmembrane domains and in signal sequences.

Release 46.4 of 29-Mar-2005

Adding the keyword 'Complete proteome' to fungal entries
The keyword 'Complete proteome' is added to UniProtKB entries which originate from an organism whose genome has been completely sequenced. Until recently this keyword was only used for proteins from complete bacterial or archaeal genomes. We want to gradually increase the scope of this keyword to other groups of species. As a first step, we have now added this keyword to entries originating from 8 complete fungal genomes, namely: Ashbya gossypii, Candida glabrata, Debaryomyces hansenii, Encephalitozoon cuniculi, Kluyveromyces lactis, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Yarrowia lipolytica.

The presence of this keyword allows to easily retrieve a complete non- redundant set of proteins from a specified genome across the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase.

Release 46.2 of 1-Mar-2005

More than 10'000 additional sequences encoded on splice variants in Swiss-Prot
Swiss-Prot is a non-redundant protein knowledgebase where the protein sequences from the same organism originating from the same gene are merged into one entry. When alternative products are produced by alternative splicing, the number of isoforms and their properties are indicated in the comment lines under the "Alternative products" topic, and the "Alternative splicing" keyword is added to the entry. The sequences of the alternative forms, if known, are described in the feature table under the key name "VARSPLIC". E.g. PRKN2_HUMAN (O60260).

More frequently than not, it is the longest isoform that is shown in a Swiss-Prot entry. The additional isoforms can be reconstructed from the annotation. We assign an unique identifier for each isoform (IsoId), and unique feature identifiers (FTId) to each "VARSPLIC" feature described in the feature table. Each IsoId comes with a list of FTIds which serve as "instructions" to be applied in order to reconstruct the isoform's sequence.

The number of these additional Swiss-Prot recreated splice variants has recently reached the 10'000 mark. More than a half are human isoforms. More than 80% are from mammals.

A fasta-formatted file containing all splice variants annotated in Swiss-Prot and TrEMBL can be downloaded for use with similarity search programs. Most sequence analysis and proteomic tools on ExPASy, e.g. BLAST or Aldente, have been adapted to take into account, in addition to all Swiss-Prot and TrEMBL entries, all other annotated splice isoforms.

Release 46.1 of 15-Feb-2005

Massive number of changes to entry names
As mentioned in the 'recent changes' document we recently allowed entry names to consist of up to 11 characters instead of 10. An entry name consists of two parts, a prefix which is a mnemonic code representing the protein name and a suffix which is a mnemonic species identification code. Example: RECA_BACSU is the entry name for the recA protein of Bacillus subtilis. The increase from 10 to 11 characters allows the protein name mnemonic to increase from 4 to 5 characters.

Thanks to this change, we are now able to assign more meaningful entry names to a significant number of entries. In the past month we have gone through almost all of the Swiss-Prot entries and have checked to see if they could benefit from an entry name update. As a consequence of this process we updated more than 35'000 entry names (about 20% of all the entries in Swiss-Prot). In about 33'000 cases we created ID prefixes consisting of 5 characters and in the rest of the cases we changed existing prefixes of 3 to 4 characters to more meaningful and consistent prefixes of the same length.

Due to this massive changes in entry names it is probable that the names of some protein entries that you are using have changed. It is therefore useful to remind our users that we provide a tool, the IDtracker which allows to trace the identifiers (ID) of protein entries. You can use this tool to enquire on the whereabouts of one or more entry names and to obtain the newly assigned names and primary accession numbers.

It can seem paradoxical that we insist in warning users that they should always use accession numbers when citing an entry, yet we strive to provide meaningful entry names. The reason for this dichotomy is simple: if you want to refer to specific entries in a publication or in any document , you need to ensure that your reference is stable, unique and unambiguous. Such a mechanism is provided by the accession numbers. Accession numbers are stable identifiers and you can be sure that you will be always able to track down a specific entry if you use an accession number. But when you want to access an entry from a web server or a sequence analysis program, then it is much easier to remember an entry name than an accession number. The human mind is structured in such a way that is generally easier for most of us to remember something like "APOA5_HUMAN" rather than something like "Q6Q788".

Release 46.0 of 01-Feb-2005

New major release is available (46.0)

Release 46.0 of Swiss-Prot contains 168'297 sequence entries, comprising 61'443'278 amino acids abstracted from 124'910 references. 4'537 sequences have been added since release 45, the sequence data of 866 existing entries has been updated and the annotations of 77'494 entries have been revised. This represents an increase of 3%.

Many improvements were carried out in the last 3 months. In particular, we have extended the format for ID lines in both Swiss-Prot and TrEMBL. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 4.0 includes Swiss-Prot release 46.0 and TrEMBL release 29.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 45.4 of 21-Dec-2004

Annotation of TRPA1 and TRPM8 transient receptors involved in sensing cold
Mammals detect temperature with specialized neurons in the peripheral nervous system. Two transient receptor-class channels, TRPA1 and TRPM8, have been implicated in sensing cold.

The first of these channels, TRPA1, is activated by temperatures below 17 degrees Celsius, which corresponds to the noxious cold threshold. The second channel, TRPM8, plays a role in sensing less extreme cool temperatures, being activated below 25 degrees Celsius.

Interestingly, the TRPM8 channel is also activated by products such as eucalyptol or menthol, which may provide a ready explanation for the sensation of coolness triggered by these flavors.

See:

Release 45.2 of 23-Nov-2004

Major update of C.elegans entries
We have recently finished a major update of Caenorhabditis elegans entries in Swiss-Prot. The following tasks were carried out:

Release 45.0 of 25-Oct-2004

New major release is available (45.0)

Release 45.0 of Swiss-Prot contains 163'235 sequence entries, comprising 59'631'787 amino acids abstracted from 120'520 references. 6'183 sequences have been added since release 44, the sequence data of 2'851 existing entries has been updated and the annotations of 71'220 entries have been revised. This represents an increase of 4%.

Many improvements were carried out in the last 3 months at the level of the DR, CC, KW and FT lines. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 3.0 includes Swiss-Prot release 45.0 and TrEMBL release 28.0. For more information you can also read the release notes, which for the first time covers the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 44.4 of 31-Aug-2004

1'500 cited journals
It is interesting to note that information relevant to the scope of Swiss-Prot is found in a continuously increasing number of scientific journals. Currently Swiss-Prot cites 1'500 different journals. Only 5 years ago, this number was slightly less than 1'000. Out of those 1'500 journals, 157 are either no longer published or have changed their names. It is also noteworthy that about 50% of these 1'500 journals are only cited less than four times in the knowledgebase. At the other extreme, only 106 journals are cited more than 100 times.

Release 44.1 of 19-Jul-2004

Annotation of HERV protein sequences
The human genome contains a number of human endogenous retroviruses (HERVs). These proviruses (the integrated form of retroviral DNA) are retroviral sequences that are transmitted vertically as part of the host germ line. A number of HERV 'families' have been identified, each derived from an independent colonisation event.

Some proviruses display open reading frames with coding capacity for a variety of viral-like proteins (Env, Gag, Pol, Pro, etc.). We have already annotated in Swiss-Prot a significant number of HERV proteins. We only include such potential proteins if they are meeting one of these three criteria: i) if there is evidence of their expression by the host, ii) if the derived sequence encodes a full-length protein, iii) if the protein has a potential cellular function.

Release 44.0 of 05-Jul-2004

New major release is available (44.0)

Release 44.0 of Swiss-Prot contains 153'825 sequence entries, comprising 56'599'343 amino acids abstracted from 117'387 references. 6'633 sequences have been added since release 43, the sequence data of 582 existing entries has been updated and the annotations of 139'855 entries have been revised. This represents an increase of 4%.

Many improvements were carried out in the last 3 months at the level of the GN, RX, CC and FT lines. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

For more information you can also read the release notes

Release 43.6 of 21-Jun-2004

Noah's ark or biodiversity in Swiss-Prot
While the crux of Swiss-Prot annotation is targeted toward a number of model organisms (human, Arabidopsis, Drosophila, E. coli, etc.), there is a continual increase in the number of species that are represented in the knowledgebase. Currently Swiss-Prot contains sequences originating from about 8'550 different species. For ~50% of these species there is only one associated entry in Swiss-Prot. This is often a protein whose gene is used for building phylogenetic trees, such as RuBisCO, cytochrome b or hemoglobin. On the other end of the spectrum, the 20 most represented species cover about 40% of the database (60'000 sequences). The most represented species is of course ourselves, with Homo sapiens filling up 7% of Swiss-Prot.

Release 43.5 of 7-Jun-2004

Fungi and Swiss-Prot
As we are reaching the 10'000 entries mark for fungi in Swiss-Prot, we believe it is useful to inform our users that we are actively working in speeding up the annotation and re-annotation of fungal protein sequences and most notably those originating from the two model organisms Saccharomyces cerevisiae and Schizosaccharomyces pombe. We are currently building up a fungal annotation group which will soon consist of four annotators, three in Geneva and one in Hinxton.

Release 43.3 of 10-May-2004

Swiss-Prot reaches 150'000 entries
With this release the number of Swiss-Prot entries has reached the 150'000 mark. It took about 9.5 years to reach the 50'000 entries mark (January 1996), almost 6 more years to reach 100'000 entries (September 2001) and about 2.5 years to the current 150'000 entries.

The continuous increase in the speed of annotation is due to a number of factors among which the increase in the number of annotators working for Swiss-Prot at SIB and EBI, increase in the productivity of the work of these annotators, the implementation and improvement of software tools that help to automate some annotation tasks, facilitated access to many third party resources, and the gradual rise in quality of the underlying DNA sequences as well as the quality of genomic and cDNA annotation.

Release 43.2 of 26-Apr-2004

Two new completely annotated microbial proteomes
In the framework of the HAMAP project we not only annotate specified microbial protein families, but we also aim to completely annotate all the proteins from a number of selected microbial genomes.

We maintain pages that list complete bacterial and archaeal proteomes and which report the status of completion of the annotations in Swiss-Prot.

We have now completed the annotation of two more microbial genomes, namely those of Buchnera aphidicola (subsp. Baizongia pistaciae) and Methanococcus jannaschii. The total number of microbial genomes where all proteins are annotated in Swiss-Prot is now 8 and more are yet to come.

Release 43.1 of 13-Apr-2004

Extinct organisms and Swiss-Prot...
Did you know that Swiss-Prot contains proteins originating from extinct organisms? Since the beginning of the 90s, various groups have sequenced gene fragments from a variety of extinct organisms. Most of the time, the resulting sequences are too small or too fragmentary to be translated into protein sequences. But this is not always the case, and we harbor a few complete or partial sequences originating from species that existed on earth in various periods of time.

For example we have a RuBisCO large subunit from a fossil leaf of a Miocene (17-20 Myr old) Magnolia, P30828.

Much more recent is a complete cytochrome b sequence from a Siberian mammoth, P92658.

But what is more interesting for those interested in the longevity of proteins, is the complete sequence of an osteocalcin from a steppe bison. This sequence was ontained by mass spectrometry directly from permafrost fossilized bones, about 55-56 Kyr old, P83489.

Release 43.0 of 29-Mar-2004

New major release is available (43.0)

Release 43.0 of Swiss-Prot contains 146'720 sequence entries, comprising 54'093'154 amino acids abstracted from 113'719 references. 10'760 sequences have been added since release 42, the sequence data of 663 existing entries has been updated and the annotations of 44'948 entries have been revised. This represents an increase of 8%.

Many improvements were carried out in the last 6 months at the level of the CC and FT lines. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

For more information you can also read the release notes

Release 42.11 of 1-Mar-2004

More than 500'000 comment blocks in Swiss-Prot
One of the important aspects of the annotation process is to provide, for each protein, a description of a number of meaningful biological elements such as the function or role of a protein, its subcellular location, its membership in a specific family, etc. All of this information is stored in the comments field (CC). Comments are organized by topics, 24 types of which are currently defined. A specific comment can consist of several sentences or other textual elements, which are grouped into what we term a comment block.

The total number of comment blocks has now reached the 500'000 mark, which corresponds to an average of 3.5 blocks per Swiss-Prot entry (cf. the release statistics).

Release 42.9 of 2-Feb-2004

SPIN - the new web tool for sequence submission to Swiss-Prot
A new web-based tool, SPIN, is available for submitting directly sequenced protein sequences and their biological annotations to the Swiss-Prot Protein Knowledgebase. SPIN guides you through a sequence of WWW forms allowing interactive submission. The information required to create a database entry will be collected during this process.

Annotation updates for existing Swiss-Prot entries are highly appreciated and should be submitted via the "Submit Update" button at the top of any entry in NiceProt view [example]. User update requests are treated with a high priority by our annotators.

Release 42.8 of 16-Jan-2004

10,000 different citations for JBC in Swiss-Prot
The Journal of Biological Chemistry (generally known as JBC) has always been a gold mine for publications directly relevant to the scope of Swiss-Prot. Starting with the first release in 1986 and up to now it has always been the most cited journal in Swiss-Prot. We are now citing about 10,000 different JBC papers in Swiss-Prot. This is almost twice the value for the next most cited journal, PNAS (Proceedings of the National Academy of Sciences of the U.S.A.).

It is also noteworthy that JBC was the first major life science journal to be available as full text on the WWW. It is therefore a good opportunity to thank the JBC editorial board and its staff for the great service they are providing to the Life Sciences community.

Release 42.7 of 15-Dec-2003

First release of UniProt
Release 42.7 of Swiss-Prot is integrated in the first release of UniProt, the Universal Protein Resource. Swiss-Prot and TrEMBL are the two sections of the UniProt Knowledgebase.

Please go to www.uniprot.org for more details on UniProt and its different components.

Release 42.5 of 21-Nov-2003

Monkey business!
The comparison of the genome of human with that of higher apes such as chimpanzees, gibbons, gorillas and the orangutans, was for a long time a wish of many life scientists. It is becoming a reality due to various sequencing initiatives targeted toward the elucidation of primate genomic sequences. However it will take some time before a significant amount of high quality complete protein sequences are available. In the meanwhile we are trying to ensure that whenever an existing higher ape sequence is available that correspond to a cognate human protein, that sequence gets annotated very quickly.

For example, in the last two weeks, the number of annotated chimpanzees protein sequences in Swiss-Prot has doubled.

Release 42.3 of 07-Nov-2003

More than 10'000 human proteins have been annotated
In the framework of the HPI project, we have annotated more than 10'000 proteins (almost 10'300). The exact number of genes represented is not exactly equal to the number of proteins for at least four reasons:

But even taking the above factors into account, we do have more than 10'000 protein-encoding genes represented in Swiss-Prot.

Release 42.0 of 10-Oct-2003

New major release is available (42.0)
Release 42.0 of Swiss-Prot contains 135'850 sequence entries, comprising 50'046'799 amino acids abstracted from 109'694 references. 13'374 sequences have been added since release 41, the sequence data of 1'298 existing entries has been updated and the annotations of 45'617 entries have been revised. This represents an increase of 11%.

Many improvements were carried out in the last 6 months at the level of the CC and FT lines. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

For more information you can also read the release notes

Release 41.18 of 25-Jul-2003

Annotation of microbial H(+)-translocating pyrophosphatases
We have annotated the microbial H(+)-translocating pyrophosphatases present in the acidocalcisome, the first eukaryotic organelle to be found in bacteria.

Acidocalcisomes are organelles that have an acidic nature, high eletronic density and contain high concentrations of calcium, magnesium, pyrophosphate and polyP. They were originally found in unicellular eukaryotes, such as Toxoplasma gondii and trypanosomatids. It has been postulated that acidocalcisomes may have an important role as an energy source and in the regulation of intracellualr pH, calcium concentration and osmotic conditions.

Now the group of Roberto Docampo has found them in the bacterium Agrobacterium tumefaciens. This is the first organelle to be found in bacteria that have a direct counterpart in eukaryotes. The typical characteristic of the acidocalcisome is the presence of a number of pumps and exchangers: one of them is the H(+)-translocating pyrophosphatase (H+-PPase). This pump generates a proton motive force and may be responsible for the synthesis of pyrophosphate. They are found in several bacteria and archaea and at present it is unkown whether any of these is also localized in acidocalcisomes. As these pumps are present only in some pathogenic bacteria but not in humans, drugs that target them might be effective against these infections.

Release 41.8 of 16-May-2003

Complete update of PDB cross-references
We have completely updated our cross-references to PDB. Thanks to work done by the EBI and Geneva Swiss-Prot groups in collaboration with the EBI MSD (Macromolecular Structure Database) group we have mapped at the atom level PDB structural data to the relevant Swiss-Prot and TrEMBL entries. This work has led to the introduction of cross-references to PDB in TrEMBL and a very significant increase in the number of these cross-references in Swiss-Prot. More than 6'000 cross-references were added and the number of Swiss-Prot entries that are linked to PDB is now above 5'300 (versus about 3'600 before this work was carried out).

Full statistics are available in the document pdbtosp.txt

Release 41.5 of 23-Apr-2003

SARS coronavirus protein sequences are available
We have made a first annotation run of the proteins potentially encoded by the SARS (Severe Acute Respiratory Syndrome) coronavirus. The following entries are available:

Nucleocapsid protein (P59595)
E1 glycoprotein (P59596)
E2 glycoprotein (P59594)
Envelope protein (P59637)
Replicase polyprotein 1ab (P59641)
Hypothetical protein X1 (P59632)
Hypothetical protein X2 (P59633)
Hypothetical protein X3 (P59634)
Hypothetical protein X4 (P59635)
Hypothetical protein 5 (P59636)